Академический Документы
Профессиональный Документы
Культура Документы
Research papers
PII: S0022-1694(19)30677-8
DOI: https://doi.org/10.1016/j.jhydrol.2019.123957
Article Number: 123957
Reference: HYDROL 123957
Please cite this article as: Tyralis, H., Papacharalampous, G., Burnetas, A., Langousis, A., Hydrological post-
processing using stacked generalization of quantile regression algorithms: Large-scale application over CONUS,
Journal of Hydrology (2019), doi: https://doi.org/10.1016/j.jhydrol.2019.123957
This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers
we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and
review of the resulting proof before it is published in its final form. Please note that during the production process
errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
Hydrological post-processing using stacked generalization of quantile
regression algorithms: Large-scale application over CONUS
(temperature, precipitation etc.; see e.g. Lidén and Harlin 2000, Mouelhi et al. 2006a, b,
Das et al. 2008, Kaleris and Langousis 2017). In this context, hydrological models can be
classified into three broad categories; i.e. physically based, conceptual, and data-driven
(see e.g. Solomatine and Wagener 2011). The output of the physically based and
conceptual models is point predictions of hydrologic quantities, which do not allow for
direct quantification of predictive uncertainties. To account for the latter, within the
general framework of probabilistic prediction (see e.g. Krzysztofowicz and Kelly 2000,
Krzysztofowicz 2001, 2002, Kavetski et al. 2002, Montanari and Brath 2004, Kuczera et
al. 2006, Todini 2007, Montanari and Grossi 2008, Weijs et al. 2010, Montanari and
2017), one needs to estimate the probability distribution function (PDF) of the
predictand variable (or the joint probability distribution function of all predictand
2017).
2
processing, the reader is referred to Li et al. (2017). Examples of relevant algorithms
(a) Quantile regression (see e.g. Koenker and Bassett Jr 1978, and Koenker 2005 on
frequency analysis in hydrology, and Weerts et al. 2011, López López et al. 2014, Dogulu
(b) Quantile regression neural networks (QRNN, where Artificial Neural Networks
are used to quantify the relationship between predictor variables and conditional
quantiles of dependent variables, see e.g. Taylor 2000, and Bogner et al. 2016 for an
Within the broader class of regression schemes, one can also consider:
(a) Autoregressive models with exogenous variables (ARX, see e.g. Reinsel 1979,
Hannan et al. 1980, Box et al. 2015, and Seo et al. 2006 for an application).
(b) Vector autoregressive models with exogenous variables (VARX, see e.g. Hannan et
al. 1980 on the methodological framework, and Bogner and Pappenberger 2011 for an
application).
(c) Use of ensemble Kalman filtering techniques (see e.g. Kalman 1960, Evensen 1994,
dependent variables are modelled using regression algorithms, see e.g. Rigby and
Stasinopoulos 2005, and Yan et al. 2014 for an application on river storage forecasts).
3
2014, p. 487) have started gaining prominence. These include Bayesian Model Averaging
(BMA, see e.g. Min and Zellner 1993, Raftery et al. 1997, 2005), non-homogenous
Gaussian regression (NGR, see e.g. Gneiting et al. 2005) and the beta-transformed linear
pool (BLP, see e.g. Ranjan and Gneiting 2010, Gneiting and Ranjan 2013), among other;
see e.g. the reviews in Bogner et al. (2017), Baran and Lerch (2018) and Wang et al.
(2019).
Most regression models belong to the families of Statistical Learning (SL, see e.g.
Hastie et al. 2009; James et al. 2013) or Machine Learning (ML) algorithms, with the
distinction between the two terms being primarily a matter of scientific debate (see e.g.
Bzdok et al. 2018). For brevity, in what follows, we use the term machine learning (ML)
for the algorithms and general methodological framework, and skip the alternative term.
Machine learning algorithms belong to the class of nonparametric methods, thus, not
providing explicit expressions for the PDFs of the obtained forecasts. The latter need to
model used, to be properly combined using methods such as BMA, BLP, NGR etc. (see
above).
expressions for the PDFs of the base-learners, Wang et al. (2019) proposed the
forecasts and predict electricity demand. CQRA is based on the minimization of the
quantile score (QS, see e.g. Koenker and Machado 1999, Friederichs and Hense 2007,
Bentzien and Friederichs 2014, referred to as pinball loss in Wang et al. 2019) over all
targeted quantiles and forecast horizons, using linear programming to estimate optimal
weights for all individual probabilistic forecasts. The method is capable of combining
4
forms (e.g. as in Tyralis and Koutsoyiannis 2014). Note that QS has been consistently
predictand variables (Bogner et al. 2016, 2017), as well as the quality (reliability,
probabilistic hydrological forecasts in the absence of explicit expressions for the PDFs of
the method is based on the minimization of the interval score (IS, also referred to as
Winkler score, Gneiting and Raftery 2007) and combines base-learners using stacked
generalization (stacking, Wolpert 1992), following the CQRA method. Stacking focuses
on the performance of the combination of the algorithms, in contrast to the widely used
in hydrology Bayesian Model Averaging, which may produce largely inaccurate results,
as proved by Yao et al. (2018). Furthermore, it has been suggested that combining
quantile forecasts (as e.g. in the CQRA method) should be preferred compared to
et al. 2013).
We introduce the method with the aim to improve probabilistic predictions when
Attributes and MEteorology for Large-sample Studies) dataset. Two experiments are
conducted in the 511 basins, i.e. (a) one-step-ahead prediction (see e.g. Evin et al. 2014)
scale (see e.g. the review in Beck et al. 2017) and, therefore, it can effectively serve for
5
validation of the introduced method. Large-scale assessments are increasingly used in
hydrological modelling and forecasting (see e.g. Perrin et al. 2001, Mouelhi et al. 2006a,
b, Bourgin et al. 2015, Langousis et al. 2016, Beck et al. 2017, Tyralis and
2019a, c, Tyralis et al. 2018a, b, Xu et al. 2018), as their results are more general than
those of case studies, while only few large scale studies currently appear in the literature
In Sections 2 and 3.3, we introduce the proposed general framework and its technical
hydrological post-processing for 511 basins (as outlined above), and illustrate its
improved performance relative to the base-learners used. Sections 4.1 and 5 discuss the
obtained results, as well as general concepts regarding the application of the method.
2. Methods
The definitions and nomenclature for the variables, sets, and methods used hereafter,
introduced by Wolpert (1992), where the base-learners are combined using another
learner, usually referred to as the combiner learner (see e.g. Alpaydin 2014, p. 504). A
note to be made here is that ensemble learning of ML algorithms should not be confused
with the general concept of ensemble forecasting in hydrology, which implies that the
estimation variance of hydrological quantities can be obtained from the spread of the
ensemble member forecasts originating from different hydrological models (see e.g.
6
Gneiting et al. 2005). In the context of probabilistic forecasts, ensemble learning stands
for the use of multiple ML algorithms to obtain individual probabilistic forecasts, and
intervals. For example, the CQRA method (Wang et al. 2019) relies on weighted
The base-learners used herein are quantile regression (QR) and quantile regression
forests (QRF, Meinshausen 2006); see Section 2.5 for details. QRF is based on random
forests (RF, Breiman 2001), and it has been used for hydrometeorological post-
e.g. Bhuiyan et al. 2018). Here QRF is introduced in the context of hydrological post-
processing. For combiner learner we use the weighted sum of the predictive quantiles,
processing streamflow simulations. The latter are obtained via the GR4J (Génie Rural à 4
al. (2003). Other hydrological models can be also used; however, our focus here is on the
(a) Experiment 1: One-step ahead predictions (as e.g. in Evin et al. 2014), where at
each time step of the prediction period, the base-learners use observed streamflow
information from the previous day, and the same-day hydrological model output.
time step, the base-learners use hydrological model outputs for the current and two
previous days.
7
variables for the base-learners (as e.g. Ye et al. 2014), and/or used to obtain
We run the calibrated hydrological model in simulation mode; i.e. we obtain the
(Klemeš 1986; see e.g. Vrugt and Robinson 2007, Montanari and Grossi 2008, Zhao et al.
2011, Evin et al. 2014, Ye et al. 2014, Dogulu et al. 2015). In this way, we assess the
possible influences imposed by the accuracy of weather forecasts. For the proposed
methodology to be used for forecasting purposes, one needs to run the hydrological
model in forecast mode; i.e. to use temperature and precipitation forecasts, instead of
recorded quantities (Klemeš 1986). In this case, the PDFs of the predictand variables are
predictions. The latter, is imposed by the intrinsically uncertain character of the weather
the post-processor assuming no uncertainty in the inputs, and then combine input
uncertainty and post-processing (see e.g. Krzysztofowicz 1999, Pagano et al. 2013).
In this Section, we present the general framework of the proposed methodology. Brief
descriptions of its specific components are given in Section 2.5. We define the interval
score of base-learner n at time t for a prediction interval 1 – a, 0 < a < 1, as (Gneiting and
Raftery 2007):
Ln,t,a(yn,t,a/2, yn,t,(1 – a/2), yt) := (yn,t,(1 – a/2) – yn,t,a/2) + (2/a) (yn,t,a/2 – yt) 1(yt < yn,t,a/2) + (2/a)
IS is a proper scoring rule to assess the properties of prediction intervals (see e.g.
8
Gneiting and Raftery 2007), which traces back to Dunsmore (1968) and Winkler (1972)
(see e.g. Gneiting and Raftery 2007) and has been used to assess the quality of
hydrometeorological forecasts (see e.g. Hamill and Wilks 1995) and hydrological
predictions (see e.g. Bock et al. 2018, Papacharalampous et al. 2019b, c). The reliability
score, which is related to IS, has been used to assess the performance of algorithms for
Also, let t {1, …, n1 + n2 + n3}, where the period T with available observations has
been divided into three consecutive subperiods T1, T2, and T3 containing n1, n2 and n3
values, respectively. The stacked algorithm is trained in the period {T1, T2}, whereas
period T3 (i.e. an independent period with data not used for training) is used to test the
stacked algorithm. In what follows, we outline the algorithmic steps used to combine the
probabilistic predictions for a specific prediction interval 1 – a (see also Figure 1 for an
illustration):
Step 1 (Train the base-learners in subperiod T1): Each of the n base-learners fn,q(∙), q
Step 2 (Use the base learners to obtain predictions in subperiod T2): The trained
base-learners of step 1 are used to predict yn,t,q ∀ t T2, n N, q {a/2, 1 – a/2}, where
xt are used as predictor variables of the trained base-learners; i.e. yn,t,q = fn,q(xt).
Step 3 (Stacked generalization): The quantity ∑t Lt,a(yt,a/2, yt,(1 – a/2), yt) is minimized in
subperiod T2, where yt,q = ∑n wn,a yn,t,q , q {a/2, 1 – a/2}, subject to the constraints
∑n wn,a = 1 and wn,a [0, 1], n N. The aim is to obtain proper weights wn,a that minimize
the total loss over different times t; i.e. ∑t Lt,a(yt,a/2, yt,(1 – a/2), yt).
Step 4 (Retrain the base-learners using the whole training period {T1, T2}: Each of the
9
n base learners fn,q(∙), q {a/2, 1 – a/2}, is trained independently again in the period {T1,
Step 5 (Obtain predictions in test period T3): The predictive quantile yt,q, q {a/2, 1 –
a/2}, at time t T3 for a given predictor variable xt, is calculated as yt,q = fe,q(xt), where
fe,q denotes the weighted sum (with weights estimated in Step 3) of the quantiles
Figure 1. Illustration of the steps of the proposed algorithm. Green horizontal lines refer
to the periods of the data used.
approaches. In the first approach (termed as ensemble learning method 1), for each
value of a, steps 1–5 are applied leading to weight combinations for the base learners
that differ for each prediction interval 1 – a. In the second approach (termed as
ensemble learning method 2), step 3 is modified to minimize the quantity ∑a ∑t Lt,a(yt,a/2,
yt,(1 – a/2), yt) (i.e. the total loss over several prediction intervals and times), instead of ∑t
Lt,a(yt,a/2, yt,(1 – a/2), yt). Hence, in the second approach, the obtained weight combinations
for the base-learners are invariant with respect to the prediction interval 1 – a, i.e. wn
10
(a) The simple averaging approach, which assigns equal weights (i.e. in our case ½ for
benchmark, as it corresponds to “an equally weighted opinion pool that is hard to beat in
practice” (see e.g. Lichtendahl Jr. et al. 2013). Simple averaging has been exploited in
quantile predictions (on the order of hundreds) obtained using simulations from a single
(b) Ensemble learners 3 and 4, which correspond to ensemble learners 1 and 2 (see
previous Section) respectively, with the difference that Step 4 of the algorithm (i.e.
retraining of the base learners in period {T1, T2}) is omitted. Thus, prediction is made
using the trained base learners of Step 1. This comparison allows quantification of the
information gain when retraining the base learners in a longer period (i.e. {T1, T2}).
(c) The QR and QRF base learners used to form the ensemble learners.
The proposed algorithm borrows concepts from the fields of hydrology, machine
learning, and statistics. The first basic concept, originating from the field of statistics, is
use of the interval score (IS) defined in eq. (1). Use of IS is substantiated by theoretical
arguments (see e.g. Gneiting and Raftery 2007), with lower values indicating better
component (yn,t,(1 – a/2) – yn,t,a/2) in eq. (1), as well as intervals that do not contain
observations (i.e. through the component of eq. (1) that remains, after subtraction of the
11
interval width). The latter penalty (hereafter referred to simply as penalty) increases
with the distance of the observations outside the prediction interval and, although more
general, it is implicitly linked to the reliability score (RS), which is defined here for base-
learner n as:
An optimal RS should have value equal to 1 – a; i.e. 1 – a of the observed values should
averaging the implemented scores over a fixed set of forecasts (see e.g. Gneiting and
stacked generalization (see Step 3 above). Stacked generalization (or stacking) is a type
of ensemble learning introduced by Wolpert (1992) (see e.g. Alpaydin 2014, p. 504 for a
“improve” the predictions of the base learners (see e.g. Breiman 1996a, Smyth and
Wolpert 1999), with the latter been used as input. Under this setting, the base learners
and the combiner learner need to be trained over different sets. Here, this is achieved by
splitting the training period into two subperiods T1 and T2. Simultaneous fitting of the
ML algorithms (i.e. Step 1 above) and estimation of the weights (i.e. Steps 2 and 3 above)
using the whole {T1, T2} period is generally not recommended, as ML algorithms tend to
independent test sets. This has been verified also in the context of the present study,
Other ensemble learning methods also exist; see e.g. the review by Sagi and Rokach
(2018). Two of the most widely used are bagging (Breiman 1996b) and boosting
12
(Friedman 2001, see also the reviews by Natekin and Knoll 2013, and Mayr et al. 2014).
Bagging averages multiple weak learners (i.e. learners with low performance, or
unstable learners), while in boosting new weak base learners are progressively
introduced and trained to minimize the error of the ensemble learner following an
iterative procedure. Thus, new models are progressively added to the ensemble. Instead,
The overall performance of the ensemble learner (formed by the combiner learner
and the base learners) depends on the efficiency of the combiner learner to properly
weigh the base learners within a given test set, and depends on the effectiveness of its
calibration in the training period T2, as well as potential similarities between periods T1
and T2. The splitting problem of a set into training and validation periods is common to
all areas of hydrology and machine learning, and addressing it goes beyond the scopes of
the present study. Here, the training set is partitioned into two subperiods (i.e. T1 and
T2) of almost equal lengths (i.e. 8 and 6 years, respectively), whereas the test set (i.e.
subperiod T3) includes 30% of the available data (i.e. 6 out of 20 years); see Section 3.2
for details. Similar relative lengths for the corresponding training and test periods have
been used in other ML studies; see e.g. (Antal et al. 2003; Yu and Xu 2008;
Papacharalampous et al. 2019a). The overall results can be considered reliable as the
length of the available data allows examination of various patterns of low and high
The proposed algorithm borrows concepts from Wang et al. (2019) and Trapero et al.
(2019), who used QS as a loss function, and Yao et al. (2018) who combined closed
13
expressions of probabilistic forecasts. In the former two studies, the weights of the base-
learners were estimated by minimizing the QS across all targeted quantiles and forecast
horizons. Here, we are interested in estimating optimal prediction intervals (i.e. pairs of
suitable than minimization of QS. The latter would lead to doubling the number of the
applied weights (i.e. one weight per bound in QS, vs. one weight per interval in IS), thus
increasing the uncertainty of the resulting predictions. We also note that existing
other methods (e.g., BMA, BLP, NGR; see Introduction), are that: a) the weight search can
and computational efficiency of the algorithm, and b) quantile crossing issues are
minimized, as the obtained weights do not depend on posterior distributions that may
generalization relative to BMA, the reader is referred to Wang et al. 2019). Further
advantages of the method inherit from the properties of stacked generalization, which is
a general methodology with deep theoretical background (for details see Wolpert 1992),
and the fact that the method is simple, straightforward to use, computationally efficient
(i.e. it takes approximately 45 min to process 511 basins with 30 years of data each
including hydrological model simulations, on a regular PC), and practical due to its full
2.5 Base-learners
General guidelines for the selection of base-learners are presented in Alpaydin (2014,
pp. 488–491). In brief, the base-learners should be simple, accurate, and diverse, so they
14
complement each other. Here we use QR and QRF as base-learners, but the method can
combine more than two quantile regression base-learners. The ensemble learner can
also include different base-learners, which originate from the same ML algorithm (e.g.
quantile regression algorithms detailing recent progress in the field can be found in
the scopes of the present study, brief descriptions of the methods and software packages
Linear in parameters quantile regression (QR) was introduced by Koenker and Bassett
(1978), while an extended treatment of the method can be found in Koenker (2005). The
whereas linear regression considers the conditional mean of the response variable. An
intuitive explanation of QR is that it fits a linear model and bisects the data so that
100 q% lie below the predicted values of the fitted model. Practically, this is done by
fitting a linear model to the data and minimizing the average QS. The method is suitable
for modelling heteroscedasticity (Koenker 2005, p. 25). We apply the method using the
rq R function of the quantreg R package (Koenker 2018), which implements the fitting
15
2.5.2 Quantile regression forests
algorithm is based on random forests (RF, Breiman 2001, see also Biau and Scornet
2016), with interest being on conditional quantiles, rather the conditional mean. RF is a
et al. 2018, Papacharalampous and Tyralis 2018), point time series forecasting in
quantities (see e.g. Tyralis et al. 2018, 2019b). An extensive review on the use of RF in
water sciences can be found in Tyralis et al. (2019a), and a detailed description of the
1996b) regression trees. In addition to bagging, the splitting at the nodes of the
functions of the events exhibiting decision tree outcomes in the test set, lower than a
Here, we apply QRF using the quantile_forest R function of the grf R package
(Tibshirani et al. 2018), which emulates Meinshausen’s (2006) algorithm (see also Athey
16
et al. 2019). The corresponding algorithm is straightforward and very simple to use,
with a few parameters to tune, while the default values in the software implementation
are near optimal (see e.g. the discussion in Verikas et al. 2011, Oshiro et al. 2012,
Scornet et al. 2015, Biau and Scornet 2016, Probst and Boulesteix 2018, Tyralis et al.
interest is in the relative improvement of the combiner learner with respect to the base-
learners used. Other properties of random forests are that they demonstrate high
predictive performance, they are non-linear and non-parametric, they are fast compared
to other machine learning algorithms, and they are stable and robust to the inclusion of
noisy predictor variables, while they do not extrapolate outside the training range
within the test set (see e.g. Biau and Scornet 2016, Tyralis et al. 2019a).
3.1 Data
A detailed presentation of CAMELS dataset, used in the present study, can be found in
Addor et al. (2017a, b), Newman et al. (2014, 2015, 2017) and Thornton et al. (2014).
The dataset comprises of daily hydrometeorological and streamflow data from 671
small- to medium-sized basins in CONUS. For each basin, the daily minimum and
maximum temperatures and precipitation have been obtained by processing the daily
dataset of Thornton et al. (2014). Changes in the basins due to human influences are
acceptable option; see e.g. Solomatine and Wagener (2011) regarding the requirements
of stationarity when changes cannot be explained deductively. Here we focus on the 34-
17
year period 1980-2013, and exclude basins with missing data or other inconsistencies.
The final sample consists of 511 basins representing most climate types over CONUS;
see Figure 2.
For each of the 511 basins, we estimate the mean daily temperature as the average of
the respective minimum and maximum daily temperatures. The daily potential
2005). For the latter, we use the PEdaily_Oudin R function of the airGR R package
(for details see Coron et al. 2017, 2018), with the daily mean temperature as input.
The GR4J model constitutes an improvement of the GR3J (Génie Rural à 3 paramètres
Journalier) model by Edijatno et al. (1999), and comprises of four parameters, while its
precursor (i.e. GR3J) comprises of three parameters (Perrin et al. 2003). The use of this
small number of parameters is fully justified in Perrin et al. (2001). The hydrological
model is herein calibrated in a non-adaptive way; i.e. the calibration is performed once
for each basin and the hydrological model is thereafter applied with fixed parameter
values (see e.g. Toth et al. 1999). Although feasible, we do not perform adaptive
calibration (see e.g. Brath and Rosso 1993, Ye et al. 2014), as its benefits are delivered
18
et al. 1999).
We use the airGR R package to apply the GR4J hydrological model to each basin. We
simulate daily streamflow with recorded daily precipitation and PET as input. The
period 1980-1981 is used to warm up the hydrological model, while period 1982-1993
algorithm using the Nash–Sutcliffe criterion (Nash and Sutcliffe 1970), to characterize
Following the notation presented in Section 2.2, we define the periods: T1 = {1994-01-
31}, and use the calibrated hydrological model to simulate daily streamflows for the
total period T = {T1, T2, T3}. The simulated streamflow vt at time t is calculated using
information until day 1993-12-31 for yt (i.e. the recorded streamflow), and until day t for
prt and pett (i.e. precipitation and potential evapotranspiration, respectively). The final
with a total of 1 120 112 simulated values in period T3, where the ensemble learner is
tested (i.e. 2 192 simulated streamflow values for each of the 511 basins).
ahead predictions; see Section 2.1) the predictor variable is defined as xt = (yt – 1, vt). Use
examples (see e.g. Krzysztofowicz 1987, Seo et al. 2006, Evin et al. 2014, Bogner et al.
19
observations. In Experiment 2, the predictor variable is defined as xt = (vt, vt – 1, vt – 2) and
transformation, with the aim to increase the performance of the model. Appropriate
transformations can be applied to both yt and vt. Several options are available in the
existing literature, such as the arcsinh(∙), log(∙), square root, Box-Cox, and Yeo-Johnson
training sets, and all ML calculations should be performed using transformed quantities.
The inverse transformation is then applied to the predicted quantiles. We tried all
previously mentioned transformations, and found that the square root transformation
was the only one not resulting in unrealistically high quantiles by the QR algorithm in
2018), QR is more robust and less sensitive to the existence of outliers of the dependent
(Díaz-Uriarte and De Andres 2006). The square root transformation has also been used
include heteroscedastic behaviour of the data, censoring (i.e. in case the predicted
effectively model heteroscedastic behaviour, such as the QR used in the present study.
20
Problems of negative quantiles were minimal in the present application. In the case of
QRF base-learners, negative values are by definition not possible, as the predicted
quantiles constitute subsets of the values found in the training set. For the QR base-
learners, the problem of negative quantiles was addressed by censoring them. Quantile
crossing problems were also minimal in the present application, and have been
Wang et al. (2019). According to the latter, if quantile q1 results to be larger than
quantile q2, with q1 < q2, then quantile q2 is set equal to quantile q1.
For period T3, Figure 3 summarizes information on the simulated and observed
streamflows for all basins analysed. Figure 3.a presents a scatterplot for the same- and
respectively. Regarding Figure 3.b, one sees that the linear regression line (red) between
the same-day hydrological simulations and observations is close to the 45-degree line
(black), indicating that the hydrological model pre-processes the data relatively well.
However, there seems to be a moderate negative bias in the estimation of high flows, as
indicated by the points lying above the 45-degree line. Also, as physically expected, the
deviation between observed and simulated flows increases with increasing lag-times;
21
Figure 3. Scatterplots of yt versus: (a) yt–1, (b) vt, (c) vt–1, and (d) vt–2 for all basins, and t
in the T3 period. The 45-degree line (black) and the linear regression line (red) between
the variables of the two axes are also presented.
two sequential days (i.e. yt, and yt – 1), indicating the appropriateness of using xt = (yt – 1,
linear regression line in Figure 3.a from the 45-degree line is larger than that in Figure
3.b, indicating the important pre-processing role of the hydrological model. Regarding
high flows, the respective points in Figure 3.a are scattered symmetrically around the
22
The appropriateness of using xt = (yt – 1, vt) as a predictor variable in hydrological
(obtained for validation period T3 and all considered basins) between yt and (a) yt–1, (b)
vt, (c) vt–1, and (d) vt–2. One sees that the correlations between yt, and each of the
variables yt–1 and vt are generally higher relative to the correlations between the
observed streamflow yt at time t, and the simulated streamflows at earlier times (i.e. vt–1
and vt–2). Correlation histograms obtained for validation period {T1, T2} are similar to
of experiments 1 and 2 at an arbitrary basin. The 0.025 and 0.975 quantiles of the base-
learners QR and QRF, and ensemble learners 1 and 2 are also presented. Visual
inspection of the post-processed simulations indicates that QR, QRF, and ensemble
23
learners 1 and 2 produce intervals that, in general, include yt. In experiment 2, the
prediction intervals are wider, due to the larger degree of uncertainty induced by the
absence of the previous-day observed streamflow yt–1 as predictor variable. In the next
For brevity, and without loss of generality, in what follows we centre the discussion to
24
performances relative to experiment 1. For all basins analysed, we assess the predictive
performance of ensemble learners 1–4 and the simple averaging method in period T3.
The assessment is made by estimating the relative improvement (RI) introduced with
respect to each of the base-learners. For instance, the relative improvement (RI) of the
interval score of learner i with respect to the nth base-learner (used for benchmarking) is
defined as:
Similarly, by substituting the interval score by its components, i.e. interval widths, and
penalty (see Section 2.4.1), one can obtain their relative improvements as well; see
Regarding experiment 1, Figure 6.a shows the mean RI (over all basins) of ensemble
learners 1–4 and simple averaging with respect to QR, for different prediction intervals
1 – a = 20, 40, 60, 80, 90, 95%. Figure 6.b presents similar results to Figure 6.a, but for
experiment 2.
Figure 6. Mean relative improvement (over all basins) of the interval score (IS) with
respect to QR in: (a) experiment 1, and (b) experiment 2, for different prediction
intervals 1 – a = 20, 40, 60, 80, 90, 95%.
A positive value of RI indicates that the examined learner improves over the
benchmark learner. Values equal to 0 indicate that the examined and benchmark
25
learners 1 and 2 improve more than 10% at prediction intervals below 80%, while the
QRF, the relative improvement is 1-2% at low prediction intervals, and increases to
intervals can be used to predict low and high flows. The diverse properties of the two
base-learners with respect to the magnitude of the prediction interval are also
probable reason for this is that, by construction, QRF cannot predict beyond the range of
observed flows in the training set, whereas the QR algorithm is regression based
approximately 2% at prediction intervals below 80%, with the two methods sharing
Regarding experiment 2 (see Figure 6.b), one sees that the RI curves are shifted
downwards relative to Figure 6.a, indicating lower overall performances associated with
the larger degree of uncertainty (relative to experiment 1) induced by the absence of the
learners 1 and 2 perform better than the base-learners, simple averaging performs
equally well to both ensemble learners 1 and 2 at all prediction intervals. This important
result indicates that the outcome of optimal weight selection is strongly influenced by
26
the uncertainty of the predictor-predictand relationship. More precisely, as the level of
may not lead to significant improvements relative to simple averaging; i.e. a uniform
When averaged over all prediction intervals, the relative improvement of the interval
score of ensemble learner 1 in experiment 1 is 8.84% with respect to QR, and 4.43%
2 are 8.55% with respect to QR and 4.18% with respect to QRF, and by simple averaging
are 7.90% and 3.60%, respectively. The slight improvement of ensemble learner 1 in
overfitting. Clearly, the two ensemble learners are able to exploit the diverse properties
performance relative to simple averaging by approximately 1%. Also, it follows from the
discussion above, that the first ensemble learner is approximately 0.5% more efficient
relative to the second one. The reason for this is that the first learner uses a combiner
algorithm that allows for additional degrees of freedom, as the weights applied to the
base-learners may vary with the prediction interval 1 – a (see Section 2.2). Note that the
significant, especially due to the size of the test set (i.e. 511 time series, each one
Wang et al. (2019) indicate 4.39% average relative improvement of the quantile score
with respect to the three base-learners used, based on eight daily time series of
electricity consumption, each one consisting of four years of data. Although smaller (due
improvements of ensemble learners 1 and 2 over the base-learners are also observed in
27
experiment 2 (see Figure 6.b). In addition, both ensemble learners appear to be overall
equivalent to simple averaging, indicating that weight optimization does not lead to
Figure 7 presents histograms of the relative improvements of the IS (see eq. (3))
introduced by the two ensemble learners in experiment 1, for all considered basins,
relative to the two base-learners. Each histogram consists of 3 066 values, which
correspond to six values (i.e. one per prediction interval 1 – a = 20, 40, 60, 80, 90, 95%)
per basin. In all cases, the improvements are mostly positive and well dispersed,
indicating that the results presented in Figure 6 (i.e. the mean relative improvement of
each ensemble learner relative to the two base learners) are not dominated by
Figure 7. Histograms of relative improvements in terms of IS, as computed for all basins
and prediction intervals in experiment 1, for ensemble learner 1 (left panels) and
ensemble learner 2 (right panels). The relative improvements with respect to quantile
regression (QR) are illustrated in the top panels, and with respect to quantile regression
forests (QRF) in the bottom panels.
28
Figure 8 and Figure 9 present boxplots of the average interval scores (IS) in
experiment 1 and experiment 2, respectively, for period T3. One sees that: a)
in both experiments, ensemble learners 1-4 and simple averaging improve over the
thus confirming that yt–1 (i.e. used as predictor variable in experiment 1) is more
informative than vt–1 and vt–2 combined (i.e. used as predictor variables in experiment 2).
29
Figure 8. Notched boxplots of average interval scores for experiment 1 in period T3, for
different prediction intervals 1 – a = (a) 20, (b) 40, (c) 60, (d) 80, (e) 90, (f) 95%. The
lower and upper hinges of the boxes correspond to the first and third quartiles. Values
exceeding the third quartile by more than 1.5 times the interquartile range, are
considered as outliers (denoted by dots).
30
Figure 9. Notched boxplots of average interval scores for experiment 2 in period T3, for
different prediction intervals 1 – a = (a) 20, (b) 40, (c) 60, (d) 80, (e) 90, (f) 95%. The
lower and upper hinges of the boxes correspond to the first and third quartiles. Values
exceeding the third quartile by more than 1.5 times the interquartile range, are
considered as outliers (denoted by dots).
To gain further insight regarding the performance of each method in the testing period
T3, Figure 10 presents, for both experiments, the ensemble mean (over all basins) of the
absolute differences between the reliability scores (see Section 2.4.1) and the
31
Figure 10. Ensemble mean (over all basins) of the absolute differences between the
reliability scores and the corresponding nominal values, for prediction intervals 1 – a =
20, 40, 60, 80, 90, 95%, in: (a) experiment 1, and (b) experiment 2.
One can see that, in experiment 1, QR performs better than QRF at prediction
intervals below 60%, whereas the performances are reversed at higher prediction
better than QRF at all prediction intervals. In both experiments, ensemble learners 3 and
4 demonstrate limited performance relative to ensemble learners 1 and 2, with the latter
two exhibiting similar performances to simple averaging, balancing those of QR and QRF
base-learners.
Figure 11 presents the median relative improvements (i.e. with respect to QR) of the
terms of prediction interval widths. Median values are preferred over mean values, to
avoid influences by very low (i.e. near-zero) prediction intervals. While the
performances of all methods are comparable in experiment 2 (see Figure 11.b) due to
the higher level of uncertainty induced by the absence of the previous-day observed
32
Figure 11. Median relative improvement (over all basins) of interval widths with respect
to QR in: (a) experiment 1, and (b) experiment 2, for different prediction intervals 1 – a
= 20, 40, 60, 80, 90, 95%.
Figure 12 presents the ensemble mean (over all basins) of the relative improvements
(with respect to QR) of penalties associated with intervals that do not contain
observations (see Section 2.4.1). The general pattern is similar to that of interval scores
in Figure 6, indicating that penalties are an important contributor to the interval score.
Figure 12. Mean relative improvement (over all basins) of penalties with respect to QR
in: (a) experiment 1, and (b) experiment 2, for different prediction intervals 1 – a = 20,
40, 60, 80, 90, 95%.
4.3 Weights
To gain insight on how the weights of the ensemble learners (see Section 2.2) are
affected by the performances of the base-learners, Figure 13 shows for ensemble learner
improvement of the average interval score of QRF relative to QR, for different prediction
33
intervals 1 – a. As expected, one sees that independent of the experiment (i.e. 1 Figure
13.a, or 2 Figure 13.b) the weights assigned to the QR base-learner tend to decrease
Figure 13. Scatterplots of the weights of the quantile regression algorithm exploited
through ensemble learner 1 in: (a) experiment 1, and (b) experiment 2, against the
relative improvement of the average interval score of QRF relative to QR in period T3.
weights assigned to the QR base-learner for varying prediction intervals 1 – a. When the
prediction interval 1 – a increases, the weights increase as well. This should be expected
because the relative gain in performance when referring to the interval score of QR with
respect to QRF increases for higher prediction intervals. Spikes at the edges of the
34
Figure 14. Histograms of the weights of the quantile regression (QR) algorithm exploited
through ensemble learner 1 in experiment 1 for different prediction intervals 1 – a = (a)
20, (b) 40, (c) 60, (d) 80, (e) 90, (f) 95%.
5. Concluding remarks
predictions. The few existing methods require formal definition of the likelihoods of the
expressions for the PDFs of the obtained forecasts. In this study, we borrowed concepts
from Wang et al. (2019) to propose an ensemble learner, which uses stacked
35
learners (i.e. quantile regression and quantile regression forests algorithms), using
The method was tested using a large dataset consisting of 511 basins. The conducted
tests focused in delivering one-step ahead predictions (experiment 1), as well as in post-
that the ensemble learners improve over the performance of the best base-learner by 1-
5%, depending on the experiment and the prediction interval. The suggested method
was also found to outperform simple averaging (i.e. a uniform weighting scheme that
assigns equal weights to all base learners), or to be sharing the first place with it in all
examined cases, with the maximum obtained improvement over this tough benchmark
The results are considered significant, especially given the length of the sample the
algorithm has been tested (i.e. post-processing of 1 120 112 hydrological predictions
from 511 time series) and the fact that simple averaging is hard to beat in practice (see
e.g. Lichtendahl Jr et al. 2013). The latter general observation indicates that when the
To the best of our knowledge, no similar study has been conducted in the
hydrological literature, with the closest work being that of Wang et al. (2019) in
electricity forecasting. The latter is based on minimization of the quantile score (QS, see
Introduction), indicating 4.39% average relative improvement with respect to the three
minimizing the interval score (IS), resulting e.g. in experiment 1 to approximately 6.5%
average relative improvement over the 2 base-learners (i.e. (8.98% + 4.44% + 8.66% +
36
4.18%)/4; see Section 4.1). Also, note that application of the constrained quantile
regression averaging (CQRA) method of Wang et al. (2019) was based on eight daily
time series of electricity consumption, each one consisting of four years of data.
One should consider the convenience of using the proposed method over other
combination methods (e.g. Bayesian Model Averaging), as well as theoretical studies that
support: a) stacking against Bayesian Model Averaging, and b) working with quantile
forecasts instead of probability distributions (see also Section 1). The extended use of
learning algorithms are accurate, they have been tested extensively in practice as well as
in forecasting competitions, they are easy to apply due to their open software
of computation times (the computations of the present study, including fitting of the
large-scale implementations.
could focus on defining optimal splitting points of the training set used, inclusion of
more base-learners, testing the method using forecasts of daily temperature and
when metrics/scores, other than IS (see e.g. Gneiting and Raftery 2007 and Shastri et al.
2017), are minimized to optimally combine probabilistic forecasts. Further uses of the
method are also possible, spanning from hydrological forecasting using data-driven
models, to water demand forecasting, to water science problems, and beyond; e.g. in
37
Conflicts of interest: We declare no conflict of interest.
Acknowledgements: We are grateful to the Editor, Associate Editor, and the reviewers
for their constructive comments and suggestions, which helped us to improve the
manuscript.
Appendix A Nomenclature
Indices
n Index of base-learners
q Index of quantiles
Sets
N Set of base-learners
Q Set of quantiles
Functions
Variables
Ln,t,a Interval score of the n-th base-learner at time t for the 1 – a prediction
38
interval
Lt,a Interval score of the weighted average of the N methods at time t for the 1 – a
prediction interval
Core Team 2019) using the following packages: airGR (Coron et al. 2017, 2019),
2017), dplyr (Wickham et al. 2019b), foreach (Microsoft and Weston 2018), gdata
(Warnes et al. 2017), ggplot2 (Wickham 2016; Wickham et al. 2019a), grf (Tibshirani
et al. 2018), knitr (Xie 2014, 2015, 2019), quantreg (Koenker 2018), readr
(Wickham et al. 2018), reshape2 (Wickham 2007, 2017), rmarkdown (Allaire et al.
References
[1] Addor N, Newman AJ, Mizukami N, Clark MP (2017a) Catchment attributes for
large-sample studies. Boulder, CO: UCAR/NCAR.
https://doi.org/10.5065/D6G73C3Q
[2] Addor N, Newman AJ, Mizukami N, Clark MP (2017b) The CAMELS data set:
Catchment attributes and meteorology for large-sample studies. Hydrology and
Earth System Sciences 21:5293–5313. https://doi.org/10.5194/hess-21-5293-
2017
[3] Allaire JJ, Xie Y, McPherson J, Luraschi J, Ushey K, Atkins A, Wickham H, Cheng J,
Chang W, Iannone R (2019) rmarkdown: Dynamic documents for R. R package
version 1.12. https://CRAN.R-project.org/package=rmarkdown
[4] Alpaydin E (2014) Introduction to Machine Learning, 3rd Edition. The MIT Press,
Cambridge, Massachusetts
[5] Antal P, Fannes G, Timmerman D, Moreau Y, de Moor B (2003) Bayesian
applications of belief networks and multilayer perceptrons for ovarian tumor
classification with rejection. Artificial Intelligence in Medicine 29(1–2):39–60.
https://doi.org/10.1016/S0933-3657(03)00053-8
[6] Athey S, Tibshirani J, Wager S (2019) Generalized random forests. The Annals of
Statistics 47(2):1148–1178. https://doi.org/10.1214/18-AOS1709
[7] Baran S, Lerch S (2018) Combining predictive distributions for the statistical
post-processing of ensemble forecasts. International Journal of Forecasting
34(3):477–496. https://doi.org/10.1016/j.ijforecast.2018.01.005
39
[8] Beck HE, van Dijk AIJM, de Roo A, Dutra E, Fink G, Orth R, Schellekens J (2017)
Global evaluation of runoff from 10 state-of-the-art hydrological models.
Hydrology and Earth System Sciences 21(6):2881–2903.
https://doi.org/10.5194/hess-21-2881-2017
[9] Bentzien S, Friederichs P (2014) Decomposition and graphical portrayal of the
quantile score. Quarterly Journal of the Royal Meteorological Society
140(683):1924–1934. https://doi.org/10.1002/qj.2284
[10] Bhuiyan MAE, Nikolopoulos EI, Anagnostou EN, Quintana-Seguí P, Barella-Ortiz
A (2018) A nonparametric statistical technique for combining global
precipitation datasets: development and hydrological evaluation over the
Iberian Peninsula. Hydrology and Earth System Sciences 22:1371–1389.
https://doi.org/10.5194/hess-22-1371-2018
[11] Biau G, Scornet E (2016) A random forest guided tour. TEST 25(2):197–227.
https://doi.org/10.1007/s11749-016-0481-7
[12] Bock AR, Farmer WH, Hay LE (2018) Quantifying uncertainty in simulated
streamflow and runoff from a continental-scale monthly water balance model.
Advances in Water Resources 122:166–175.
https://doi.org/10.1016/j.advwatres.2018.10.005
[13] Bogner K, Pappenberger F (2011) Multiscale error analysis, correction, and
predictive uncertainty estimation in a flood forecasting system. Water
Resources Research 47(7):W07524. https://doi.org/10.1029/2010WR009137
[14] Bogner K, Pappenberger F, Cloke HL (2012) Technical Note: The normal
quantile transformation and its application in a flood forecasting system.
Hydrology and Earth System Sciences 16:1085–1094.
https://doi.org/10.5194/hess-16-1085-2012
[15] Bogner K, Liechti K, Zappa M (2016) Post-processing of stream flows in
Switzerland with an emphasis on low flows and floods. Water 8(4):115.
https://doi.org/10.3390/w8040115
[16] Bogner K, Liechti K, Zappa M (2017) Technical note: Combining quantile
forecasts and predictive distributions of streamflows. Hydrology and Earth
System Sciences 21:5493–5502. https://doi.org/10.5194/hess-21-5493-2017
[17] Bourgin F, Andréassian V, Perrin C, Oudin L (2015) Transferring global
uncertainty estimates from gauged to ungauged catchments. Hydrology and
Earth System Sciences 19:2535–2546. https://doi.org/10.5194/hess-19-2535-
2015
[18] Box GEP, Jenkins GM, Cloke HL, Reinsel GC, Ljung GM (2015) Time Series
Analysis: Forecasting and Control, 5th Edition. John Wiley & Sons, Inc., Hoboken,
New Jersey
[19] Brath A, Rosso R (1993) Adaptive calibration of a conceptual model for flash
flood forecasting. Water Resources Research 29(8):2561–2572.
https://doi.org/10.1029/93WR00665
[20] Breiman L (1996a) Stacked regressions. Machine Learning 24(1):49–64.
https://doi.org/10.1007/BF00117832
[21] Breiman L (1996b) Bagging predictors. Machine Learning 24(2):123–140.
https://doi.org/10.1007/BF00058655
[22] Breiman L (2001) Random forests. Machine Learning 45(1):5–32.
https://doi.org/10.1023/A:1010933404324
[23] Bzdok D, Altman N, Krzywinski M (2018) Statistics versus machine learning.
Nature Methods 15:233–234. https://doi.org/10.1038/nmeth.4642
40
[24] Coron L, Thirel G, Delaigue O, Perrin C, Andréassian V (2017) The suite of
lumped GR hydrological models in an R package. Environmental Modelling and
Software 94:166–171. https://doi.org/10.1016/j.envsoft.2017.05.002
[25] Coron L, Delaigue O, Thirel G, Perrin C, Michel C (2019) airGR: Suite of GR
hydrological models for precipitation-runoff modelling. R package version
1.2.13.16. https://CRAN.R-project.org/package=airGR
[26] Das T, Bárdossy A, Zehe E, He Y (2008) Comparison of conceptual model
performance using different representations of spatial variability. Journal of
Hydrology 356(1–2):106–118. https://doi.org/10.1016/j.jhydrol.2008.04.008
[27] Díaz-Uriarte R, De Andres SA (2006) Gene selection and classification of
microarray data using random forest. BMC Bioinformatics 7:3.
https://doi.org/10.1186/1471-2105-7-3
[28] Dogulu N, López López P, Solomatine DP, Weerts AH, Shrestha DL (2015)
Estimation of predictive hydrologic uncertainty using the quantile regression
and UNEEC methods and their comparison on contrasting catchments.
Hydrology and Earth System Sciences 19:3181–3201.
https://doi.org/10.5194/hess-19-3181-2015
[29] Dowle M, Srinivasan A (2019) data.table: Extension of 'data.frame'. R package
version 1.12.2. https://CRAN.R-project.org/package=data.table
[30] Dunsmore IR (1968) A Bayesian approach to calibration. Journal of the Royal
Statistical Society. Series B (Methodological) 30(2):396–405.
https://doi.org/10.1016/j.rser.2018.05.038
[31] Edijatno, Nascimento NO, Yang X, Makhlouf Z, Michel C (1999) GR3J: A daily
watershed model with three free parameters. Hydrological Sciences Journal
44(2):263–277. https://doi.org/10.1080/02626669909492221
[32] Evensen G (1994) Sequential data assimilation with a nonlinear
quasi-geostrophic model using Monte Carlo methods to forecast error statistics.
Journal of Geophysical Research 99(C5):10143–10162.
https://doi.org/10.1029/94JC00572
[33] Evin G, Thyer M, Kavetski D, McInerney D, Kuczera G (2014) Comparison of joint
versus postprocessor approaches for hydrological uncertainty estimation
accounting for error autocorrelation and heteroscedasticity. Water Resources
Research 50(3):2350–2375. https://doi.org/10.1002/2013WR014185
[34] Friederichs P, Hense A (2007) Statistical downscaling of extreme precipitation
events using censored quantile regression. Monthly Weather Review 135:2365–
2378. https://doi.org/10.1175/MWR3403.1
[35] Friedman JH (2001) Greedy function approximation: A gradient boosting
machine. The Annals of Statistics 29(5):1189–1232.
https://doi.org/10.1214/aos/1013203451
[36] Gagolewski M (2019) stringi: Character string processing facilities. R package
version 1.4.3. https://CRAN.R-project.org/package=stringi
[37] Gneiting T, Raftery AE (2007) Strictly proper scoring rules, prediction, and
estimation. Journal of the American Statistical Association 102(477):359–378.
https://doi.org/10.1198/016214506000001437
[38] Gneiting T, Ranjan R (2013) Combining predictive distributions. Electronic
Journal of Statistics 7:1747–1782. https://doi.org/10.1214/13-EJS823
41
[39] Gneiting T, Raftery AE, Westveld AH, Goldman T (2005) Calibrated probabilistic
forecasting using ensemble model output statistics and minimum CRPS
Estimation. Monthly Weather Review 133:1098–1118.
https://doi.org/10.1175/MWR2904.1
[40] Hamill TM, Wilks DS (1995) A Probabilistic forecast contest and the difficulty in
assessing short-range forecast uncertainty. Weather and Forecasting 10:620–
631. https://doi.org/10.1175/1520-0434(1995)010<0620:APFCAT>2.0.CO;2
[41] Hannan EJ, Dunsmuir WTM, Deistler M (1980) Estimation of vector ARMAX
models. Journal of Multivariate Analysis 10(3):275–295.
https://doi.org/10.1016/0047-259X(80)90050-0
[42] Hastie T, Tibshirani R, Friedman J (2009) The Elements of Statistical Learning.
Springer-Verlag New York. https://doi.org/10.1007/978-0-387-84858-7
[43] Hemri S (2018) Chapter 8 - Applications of Postprocessing for Hydrological
Forecasts. In: Vannitsem S, Wilks DS, Messner JW (eds) Statistical
Postprocessing of Ensemble Forecasts. Elsevier, pp 219–240.
https://doi.org/10.1016/B978-0-12-812372-0.00008-X
[44] Hernández-López MR, Francés F (2017) Bayesian joint inference of hydrological
and generalized error models with the enforcement of Total Laws. Hydrology
and Earth System Sciences Discussions. https://doi.org/10.5194/hess-2017-9
[45] Hong T, Pinson P, Fan S, Zareipour H, Troccoli A, Hyndman RJ (2016)
Probabilistic energy forecasting: Global Energy Forecasting Competition 2014
and beyond. International Journal of Forecasting 32(3):896–913.
https://doi.org/10.1016/j.ijforecast.2016.02.001
[46] James G, Witten D, Hastie T, Tibshirani R (2013) An Introduction to Statistical
Learning. Springer-Verlag New York. https://doi.org/10.1007/978-1-4614-
7138-7
[47] Kaleris V, Langousis A (2017) Comparison of two rainfall–runoff models: effects
of conceptualization on water budget components. Hydrological Sciences
Journal 62(5):729–748. https://doi.org/10.1080/02626667.2016.1250899
[48] Kalman RE (1960) A new approach to linear filtering and prediction problems.
Journal of Basic Engineering 82(1):35–45. https://doi.org/10.1115/1.3662552
[49] Kavetski D, Franks SW, Kuczera G (2002) Confronting Input Uncertainty in
Environmental Modelling. In: Duan Q, Gupta HV, Sorooshian S, Rousseau AN,
Turcotte R (eds) Calibration of Watershed Models. AGU, pp 49–68.
https://doi.org/10.1029/WS006p0049
[50] Klemeš V (1986) Operational testing of hydrological simulation models.
Hydrological Sciences Journal 31(1):13–24.
https://doi.org/10.1080/02626668609491024
[51] Koenker RW (2005) Quantile regression. Cambridge University Press,
Cambridge, UK
[52] Koenker RW (2017) Quantile regression: 40 years on. Annual Review of
Economics 9(1):155–176. https://doi.org/10.1146/annurev-economics-
063016-103651
[53] Koenker RW (2018) quantreg: Quantile regression. R package version 5.38.
https://CRAN.R-project.org/package=quantreg
[54] Koenker RW, Bassett Jr G (1978) Regression quantiles. Econometrica 46(1):33–
50. https://doi.org/10.2307/1913643
42
[55] Koenker RW, D'Orey V (1987) Computing regression quantiles. Journal of the
Royal Statistical Society: Series C (Applied Statistics) 36(3):383–393.
https://doi.org/10.2307/2347802
[56] Koenker RW, D'Orey V (1994) A remark on algorithm AS 229: Computing dual
regression quantiles and regression rank scores. Journal of the Royal Statistical
Society: Series C (Applied Statistics) 43(2):410–414.
https://doi.org/10.2307/2986030
[57] Koenker RW, Machado JAF (1999) Goodness of fit and related inference
processes for quantile regression. Journal of the American Statistical
Association 94(448):1296–1310.
https://doi.org/10.1080/01621459.1999.10473882
[58] Koutsoyiannis D, Montanari A (2015) Negligent killing of scientific concepts: the
stationarity case. Hydrological Sciences Journal 60(7–8):1174–1183.
https://doi.org/10.1080/02626667.2014.959959
[59] Krzysztofowicz R (1987) Markovian forecast processes. Journal of the American
Statistical Association 82(397):31–37.
https://doi.org/10.1080/01621459.1987.10478387
[60] Krzysztofowicz R (1997) Transformation and normalization of variates with
specified distributions. Journal of Hydrology 1997(1–4):286–292.
https://doi.org/10.1016/S0022-1694(96)03276-3
[61] Krzysztofowicz R (1999) Bayesian theory of probabilistic forecasting via
deterministic hydrologic model. Water Resources Research 35(9):2739–2750.
https://doi.org/10.1029/1999WR900099
[62] Krzysztofowicz R (2001) The case for probabilistic forecasting in hydrology.
Journal of Hydrology 249(1–4):2–9. https://doi.org/10.1016/S0022-
1694(01)00420-6
[63] Krzysztofowicz R (2002) Bayesian system for probabilistic river stage
forecasting. Journal of Hydrology 268:16–40. https://doi.org/10.1016/S0022-
1694(02)00106-3
[64] Krzysztofowicz R, Kelly KS (2000) Hydrologic uncertainty processor for
probabilistic river stage forecasting. Water Resources Research 36:3265–3277.
https://doi.org/10.1029/2000WR900108
[65] Kuczera G, Kavetski D, Franks S, Thyer M (2006) Towards a Bayesian total error
analysis of conceptual rainfall-runoff models: Characterising model error using
storm-dependent parameters. Journal of Hydrology 331(1–2):161–177.
https://doi.org/10.1016/j.jhydrol.2006.05.010
[66] Langousis A, Mamalakis A, Puliga M, Deida R (2016) Threshold detection for the
generalized Pareto distribution: Review of representative methods and
application to the NOAA NCDC daily rainfall database. Water Resources
Research 52(4):2659–2681. https://doi.org/10.1002/2015WR018502
[67] Li W, Duan Q, Miao C, Ye A, Gong W, Di Z (2017) A review on statistical
postprocessing methods for hydrometeorological ensemble forecasting. Wiley
Interdisciplinary Reviews: Water 4(6):e1246.
https://doi.org/10.1002/wat2.1246
[68] Lichtendahl Jr KC, Grushka-Cockayne Y, Winkler RL (2013) Is it better to
average probabilities or quantiles?. Management Science 59(7):1594–1611.
https://doi.org/10.1287/mnsc.1120.1667
43
[69] Lidén R, Harlin J (2000) Analysis of conceptual rainfall–runoff modelling
performance in different climates. Journal of Hydrology 238(3–4):231–247.
https://doi.org/10.1016/S0022-1694(00)00330-9
[70] López López P, Verkade JS, Weerts AH, Solomatine DP (2014) Alternative
configurations of quantile regression for estimating predictive uncertainty in
water level forecasts for the upper Severn River: a comparison. Hydrology and
Earth System Sciences 18:3411–3428. https://doi.org/10.5194/hess-18-3411-
2014
[71] Mayr A, Binder H, Gefeller O, Schmid M (2014) The evolution of boosting
algorithms. Methods of Information in Medicine 53(06):419–427.
https://doi.org/10.3414/ME13-01-0122
[72] Meinshausen N (2006) Quantile regression forests. Journal of Machine Learning
Research 7:983–999
[73] Messner JW (2018) Chapter 11 - Ensemble Postprocessing With R. In:
Vannitsem S, Wilks DS, Messner JW (eds) Statistical Postprocessing of Ensemble
Forecasts. Elsevier, pp 291–329. https://doi.org/10.1016/B978-0-12-812372-
0.00011-X
[74] Michel C (1991) Hydrologie appliquée aux petits bassins ruraux. Cemagref,
Antony, France
[75] Microsoft, Weston S (2017) foreach: Provides foreach looping construct for R. R
package version 1.4.4. https://CRAN.R-project.org/package=foreach
[76] Microsoft Corporation, Weston S (2018) doParallel: Foreach parallel adaptor for
the 'parallel' package. R package version 1.0.14. https://CRAN.R-
project.org/package=doParallel
[77] Min C, Zellner A (1993) Bayesian and non-Bayesian methods for combining
models and forecasts with applications to forecasting international growth
rates. Journal of Econometrics 56(1–2):89–118. https://doi.org/10.1016/0304-
4076(93)90102-B
[78] Montanari A (2011) 2.17 - Uncertainty of Hydrological Predictions. In: Wilderer
P (ed) Treatise on Water Science. Elsevier, pp 459–478.
https://doi.org/10.1016/B978-0-444-53199-5.00045-2
[79] Montanari A, Brath A (2004) A stochastic approach for assessing the
uncertainty of rainfall-runoff simulations. Water Resources Research
40(1):W01106. https://doi.org/10.1029/2003WR002540
[80] Montanari A, Grossi G (2008) Estimating the uncertainty of hydrological
forecasts: A statistical approach. Water Resources Research 44(12):W00B08.
https://doi.org/10.1029/2008WR006897
[81] Montanari A, Koutsoyiannis D (2012) A blueprint for process-based modeling of
uncertain hydrological systems. Water Resources Research 48(9):W09555.
https://doi.org/10.1029/2011WR011412
[82] Mouelhi S, Michel C, Perrin C, Andréassian V (2006a) Stepwise development of a
two-parameter monthly water balance model. Journal of Hydrology 318(1–
4):200–214. https://doi.org/10.1016/j.jhydrol.2005.06.014
[83] Mouelhi S, Michel C, Perrin C, Andréassian V (2006b) Linking stream flow to
rainfall at the annual time step: the Manabe bucket model revisited. Journal of
Hydrology 328(1–2):283–296. https://doi.org/10.1016/j.jhydrol.2005.12.022
[84] Nash JE, Sutcliffe JV (1970) River flow forecasting through conceptual models
part I — A discussion of principles. Journal of Hydrology 10(3):282–290.
https://doi.org/10.1016/0022-1694(70)90255-6
44
[85] Natekin A, Knoll A (2013) Gradient boosting machines, a tutorial. Frontiers in
Neurorobotics 7:21. https://doi.org/10.3389/fnbot.2013.00021
[86] Newman AJ, Sampson K, Clark MP, Bock A, Viger RJ, Blodgett D (2014) A large-
sample watershed-scale hydrometeorological dataset for the contiguous USA.
Boulder, CO: UCAR/NCAR. https://doi.org/10.5065/D6MW2F4D
[87] Newman AJ, Clark MP, Sampson K, Wood A, Hay LE, Bock A, Viger RJ, Blodgett D,
Brekke L, Arnold JR, Hopson T, Duan Q (2015) Development of a large-sample
watershed-scale hydrometeorological data set for the contiguous USA: data set
characteristics and assessment of regional variability in hydrologic model
performance. Hydrology and Earth System Sciences 19:209–223.
https://doi.org/10.5194/hess-19-209-2015
[88] Newman AJ, Mizukami N, Clark MP, Wood AW, Nijssen B, Nearing G (2017)
Benchmarking of a physically based hydrologic model. Journal of
Hydrometeorology 18:2215–2225. https://doi.org/10.1175/JHM-D-16-0284.1
[89] Nikolopoulos EI, Destro E, Bhuiyan MAE, Borga M, Anagnostou EN (2018)
Evaluation of predictive models for post-fire debris flow occurrence in the
western United States. Natural Hazards and Earth System Sciences 18:2331–
2343. https://doi.org/10.5194/nhess-18-2331-2018
[90] Oshiro TM, Perez PS, Baranauskas JA (2012) How many trees in a random
forest?. In: Perner P (ed) Machine Learning and Data Mining in Pattern
Recognition (Lecture Notes in Computer Science). Springer-Verlag Berlin
Heidelberg, IBaI, Leipzig, Germany, 2012; Volume 7376, pp 154–168.
https://doi.org/10.1007/978-3-642-31537-4
[91] Ouali D, Chebana F, Ouarda TBMJ (2016) Quantile regression in regional
frequency analysis: A better exploitation of the available information. Journal of
Hydrometeorology 17:1869–1883. https://doi.org/10.1175/JHM-D-15-0187.1
[92] Oudin L, Hervieu F, Michel C, Perrin C, Andréassian V, Anctil F, Loumagne C
(2005) Which potential evapotranspiration input for a lumped rainfall–runoff
model?: Part 2—Towards a simple and efficient potential evapotranspiration
model for rainfall–runoff modelling. Journal of Hydrology 303(1–4):290–306.
https://doi.org/10.1016/j.jhydrol.2004.08.026
[93] Pagano TC, Shrestha DL, Wang QJ, Robertson D, Hapuarachchi P (2013)
Ensemble dressing for hydrological applications. Hydrological Processes
27(1):106–116. https://doi.org/10.1002/hyp.9313
[94] Papacharalampous G, Tyralis H (2018) Evaluation of random forests and
Prophet for daily streamflow forecasting. Advances in Geosciences 45:201–208.
https://doi.org/10.5194/adgeo-45-201-2018
[95] Papacharalampous G, Tyralis H, Koutsoyiannis D (2018a) One-step ahead
forecasting of geophysical processes within a purely statistical framework.
Geoscience Letters 5(12). https://doi.org/10.1186/s40562-018-0111-1
[96] Papacharalampous G, Tyralis H, Koutsoyiannis D (2018b) Predictability of
monthly temperature and precipitation using automatic time series forecasting
methods. Acta Geophysica 66(4):807–831. https://doi.org/10.1007/s11600-
018-0120-7
[97] Papacharalampous G, Tyralis H, Koutsoyiannis D (2018c) Univariate time series
forecasting of temperature and precipitation with a focus on machine learning
algorithms: A multiple-case study from Greece. Water Resources Management
32(15):5207–5239. https://doi.org/10.1007/s11269-018-2155-6
45
[98] Papacharalampous G, Tyralis H, Koutsoyiannis D (2019a) Comparison of
stochastic and machine learning methods for multi-step ahead forecasting of
hydrological processes. Stochastic Environmental Research and Risk
Assessment 33(2):481–514. https://doi.org/10.1007/s00477-018-1638-6
[99] Papacharalampous G, Koutsoyiannis D, Montanari A (2019b) Quantification of
predictive uncertainty in hydrological modelling by harnessing the wisdom of
the crowd: Methodology development and investigation using toy models.
https://doi.org/10.13140/RG.2.2.32868.22401
[100] Papacharalampous G, Tyralis H, Koutsoyiannis D, Montanari A (2019c)
Quantification of predictive uncertainty in hydrological modelling by harnessing
the wisdom of the crowd: A large–sample experiment at monthly timescale.
https://doi.org/10.13140/RG.2.2.16091.00801
[101] Perrin C, Michel C, Andréassian V (2001) Does a large number of parameters
enhance model performance? Comparative assessment of common catchment
model structures on 429 catchments. Journal of Hydrology 242(3–4):275–301.
https://doi.org/10.1016/S0022-1694(00)00393-0
[102] Perrin C, Michel C, Andréassian V (2003) Improvement of a parsimonious model
for streamflow simulation. Journal of Hydrology 279(1–4):275–289.
https://doi.org/10.1016/S0022-1694(03)00225-7
[103] Peterson RA (2018) bestNormalize: Normalizing transformation functions. R
package version 1.3.0. https://CRAN.R-project.org/package=bestNormalize
[104] Probst P, Boulesteix AL (2018) To tune or not to tune the number of trees in
random forest. Journal of Machine Learning Research 18(181):1–18
[105] R Core Team (2019) R: A language and environment for statistical computing. R
Foundation for Statistical Computing, Vienna, Austria. https://www.R-
project.org/
[106] Raftery AE, Madigan D, Hoeting JA (1997) Bayesian model averaging for linear
regression models. Journal of the American Statistical Association 92(437):179–
191. https://doi.org/10.1080/01621459.1997.10473615
[107] Raftery AE, Gneiting T, Balabdaoui F, Polakowski M (2005) Using Bayesian
model averaging to calibrate forecast ensembles. Monthly Weather Review
133:1155–1174. https://doi.org/10.1175/MWR2906.1
[108] Ranjan R, Gneiting T (2010) Combining probability forecasts. Journal of the
Royal Statistical Society: Series B (Statistical Methodology) 72(1):71–91.
https://doi.org/10.1111/j.1467-9868.2009.00726.x
[109] Reinsel G (1979) Maximum likelihood estimation of stochastic linear difference
equations with autoregressive moving average errors. Econometrica
47(1):129–151. https://doi.org/10.2307/1912351
[110] Rigby RA, Stasinopoulos DM (2005) Generalized additive models for location,
scale and shape. Journal of the Royal Statistical Society: Series C (Applied
Statistics) 54(3):507–554. https://doi.org/10.1111/j.1467-9876.2005.00510.x
[111] Sagi O, Rokach L (2018) Ensemble learning: A survey. Wiley Interdisciplinary
Reviews: Data Mining and Knowledge Discovery 8(4):e1249.
https://doi.org/10.1002/widm.1249
[112] Scornet E, Biau G, Vert JP (2015) Consistency of random forests. The Annals of
Statistics 43(4):1716–1741. https://doi.org/10.1214/15-AOS1321
46
[113] Seo DJ, Herr HD, Schaake JC (2006) A statistical post-processor for accounting of
hydrologic uncertainty in short-range ensemble streamflow prediction.
Hydrology and Earth System Sciences Discussions 3:1987–2035.
https://doi.org/10.5194/hessd-3-1987-2006
[114] Shastri H, Ghosh S, Karmakar S (2017) Improving global forecast system of
extreme precipitation events with regional statistical model: Application of
quantile-based probabilistic forecasts. Journal of Geophysical Research
122(3):1617–1634. https://doi.org/10.1002/2016JD025489
[115] Smyth P, Wolpert D (1999) Linearly combining density estimators via stacking.
Machine Learning 36(1–2):59–83. https://doi.org/10.1023/A:1007511322260
[116] Solomatine DP, Wagener T (2011) 2.16 - Hydrological Modeling. In: Wilderer P
(ed) Treatise on Water Science. Elsevier, pp 435–457.
https://doi.org/10.1016/B978-0-444-53199-5.00044-0
[117] Taillardat M, Mestre O, Zamo M, Naveau P (2016) Calibrated ensemble forecasts
using quantile regression forests and ensemble model output statistics. Monthly
Weather Review 144:2375–2393. https://doi.org/10.1175/MWR-D-15-0260.1
[118] Taylor JW (2000) A quantile regression neural network approach to estimating
the conditional density of multiperiod returns. Journal of Forecasting
19(4):299–311. https://doi.org/10.1002/1099-131X(200007)19:4<299::AID-
FOR775>3.0.CO;2-V
[119] Thornton PE, Thornton MM, Mayer BW, Wilhelmi N, Wei Y, Devarakonda R,
Cook RB (2014) Daymet: Daily surface weather data on a 1-km grid for North
America, version 2. ORNL DAAC, Oak Ridge, Tennessee, USA. Date accessed:
2016/01/20. https://doi.org/10.3334/ORNLDAAC/1219
[120] Tibshirani J, Athey S, Wager S (2018) grf: Generalized random forests (beta). R
package version 0.10.2. https://CRAN.R-project.org/package=grf
[121] Todini E (2007) Hydrological catchment modelling: Past, present and future.
Hydrology and Earth System Sciences 11:468–482.
https://doi.org/10.5194/hess-11-468-2007
[122] Toth E, Montanari A, Brath A (1999) Real-time flood forecasting via combined
use of conceptual and stochastic models. Physics and Chemistry of the Earth,
Part B: Hydrology, Oceans and Atmosphere 24(7):793–798.
https://doi.org/10.1016/S1464-1909(99)00082-9
[123] Trapero JR, Cardós M, Kourentzes N (2019) Quantile forecast optimal
combination to enhance safety stock estimation. International Journal of
Forecasting 35(1):239–250. https://doi.org/10.1016/j.ijforecast.2018.05.009
[124] Tyralis H, Koutsoyiannis D (2014) A Bayesian statistical model for deriving the
predictive distribution of hydroclimatic variables. Climate Dynamics 42(11–
12):2867–2883. https://doi.org/10.1007/s00382-013-1804-y
[125] Tyralis H, Koutsoyiannis D (2017) On the prediction of persistent processes
using the output of deterministic models. Hydrological Sciences Journal
62(13):2083–2102. https://doi.org/10.1080/02626667.2017.1361535
[126] Tyralis H, Papacharalampous G (2017) Variable selection in time series
forecasting using random forests. Algorithms 10(4):114.
https://doi.org/10.3390/a10040114
[127] Tyralis H, Papacharalampous G (2018) Large-scale assessment of Prophet for
multi-step ahead forecasting of monthly streamflow. Advances in Geosciences
45:147–153. https://doi.org/10.5194/adgeo-45-147-2018
47
[128] Tyralis H, Dimitriadis P, Koutsoyiannis D, O'Connell PE, Tzouka K, Iliopoulou T
(2018) On the long-range dependence properties of annual precipitation using a
global network of instrumental measurements. Advances in Water Resources
111:301–318. https://doi.org/10.1016/j.advwatres.2017.11.010
[129] Tyralis H, Papacharalampous G, Langousis A (2019a) A brief review of random
forests for water scientists and practitioners and their recent history in water
resources. Water 11(5):910. https://doi.org/10.3390/w11050910
[130] Tyralis H, Papacharalampous G, Tantanee S (2019b) How to explain and predict
the shape parameter of the generalized extreme value distribution of
streamflow extremes using a big dataset. Journal of Hydrology 574:628–645.
https://doi.org/10.1016/j.jhydrol.2019.04.070
[131] Verikas A, Gelzinis A, Bacauskiene M (2011) Mining data with random forests: A
survey and results of new tests. Pattern Recognition 44(2):330–349.
https://doi.org/10.1016/j.patcog.2010.08.011
[132] Vrugt JA, Robinson BA (2007) Treatment of uncertainty using ensemble
methods: Comparison of sequential data assimilation and Bayesian model
averaging. Water Resources Research 43(1):W01411.
https://doi.org/10.1029/2005WR004838
[133] Waldmann E (2018) Quantile regression: A short story on how and why.
Statistical Modelling 18(3–4):203–218.
https://doi.org/10.1177/1471082X18759142
[134] Wang Y, Zhang N, Tan Y, Hong T, Kirschen DS, Kang C (2019) Combining
probabilistic load forecasts. IEEE Transactions on Smart Grid 10(4):3664–3674.
https://doi.org/10.1109/TSG.2018.2833869
[135] Warnes GR, Bolker B, Gorjanc G, Grothendieck G, Korosec A, Lumley T,
MacQueen D, Magnusson A, Rogers J (2017) gdata: Various R programming
tools for data manipulation. R package version 2.18.0. https://CRAN.R-
project.org/package=gdata
[136] Weerts AH, Winsemius HC, Verkade JS (2011) Estimation of predictive
hydrological uncertainty using quantile regression: Examples from the national
flood forecasting system (England and Wales). Hydrology and Earth System
Sciences 15:255–265. https://doi.org/10.5194/hess-15-255-2011
[137] Weijs SV, Schoups G, Van de Giesen N (2010) Why hydrological predictions
should be evaluated using information theory. Hydrology and Earth System
Sciences 14:2545–2558. https://doi.org/10.5194/hess-14-2545-2010
[138] Wickham H (2007) Reshaping data with the reshape package. Journal of the
Statistical Software 21(12). https://doi.org/10.18637/jss.v021.i12
[139] Wickham H (2016) ggplot2. Springer-Verlag New York.
https://doi.org/10.1007/978-0-387-98141-3
[140] Wickham H (2017) reshape2: Flexibly reshape data: A reboot of the reshape
package. R package version 1.4.3. https://CRAN.R-
project.org/package=reshape2
[141] Wickham H (2019) stringr: Simple, consistent wrappers for common string
operations. R package version 1.4.0. https://CRAN.R-
project.org/package=stringr
[142] Wickham H, Hester J, Francois R (2018) readr: Read rectangular text data. R
package version 1.3.1. https://CRAN.R-project.org/package=readr
48
[143] Wickham H, Chang W, Henry L, Pedersen TL, Takahashi K, Wilke C, Woo K
(2019a) ggplot2: Create elegant data visualisations using the grammar of
graphics. R package version 3.1.1. https://CRAN.R-project.org/package=ggplot2
[144] Wickham H, François R, Henry L, Müller K (2019b) dplyr: A grammar of data
manipulation. R package version 0.8.0.1. https://CRAN.R-
project.org/package=dplyr
[145] Wickham H, Hester J, Chang W (2019c) devtools: Tools to make developing R
packages easier. R package version 2.0.2. https://CRAN.R-
project.org/package=devtools
[146] Winkler RL (1972) A decision-theoretic approach to interval estimation. Journal
of the American Statistical Association 67(337):187–191.
https://doi.org/10.1080/01621459.1972.10481224
[147] Wolpert DH (1992) Stacked generalization. Neural Networks 5(2):241–259.
https://doi.org/10.1016/S0893-6080(05)80023-1
[148] Xie Y (2014) knitr: A Comprehensive Tool for Reproducible Research in R. In:
Stodden V, Leisch F, Peng RD (Eds) Implementing Reproducible Computational
Research. Chapman and Hall/CRC
[149] Xie Y (2015) Dynamic Documents with R and knitr, 2nd edition. Chapman and
Hall/CRC
[150] Xie Y (2019) knitr: A general-purpose package for dynamic report generation in
R. R package version 1.22. https://CRAN.R-project.org/package=knitr
[151] Xu L, Chen N, Zhang X, Chen Z (2018) An evaluation of statistical, NMME and
hybrid models for drought prediction in China. Journal of Hydrology 566:235–
249. https://doi.org/10.1016/j.jhydrol.2018.09.020
[152] Yan J, Liao GY, M Gebremichael, Shedd R, Vallee DR (2014) Characterizing the
uncertainty in river stage forecasts conditional on point forecast values. Water
Resources Research 48(12):W12509. https://doi.org/10.1029/2012WR011818
[153] Yao Y, Vehtari A, Simpson D, Gelman A (2018) Using stacking to average
Bayesian predictive distributions. Bayesian Analysis 13(3):917–1003.
https://doi.org/10.1214/17-BA1091
[154] Ye A, Duan Q, Yuan X, Wood EF, Schaake J (2014) Hydrologic post-processing of
MOPEX streamflow simulations. Journal of Hydrology 508:147–156.
https://doi.org/10.1016/j.jhydrol.2013.10.055
[155] Yu B, Xu Z (2008) A comparative study for content-based dynamic spam
classification using four machine learning algorithms. Knowledge-Based
Systems 21(4):355–362. https://doi.org/10.1016/j.knosys.2008.01.001
[156] Zhao L, Duan Q, Schaake J, Ye A, Xia J (2011) A hydrologic post-processor for
ensemble streamflow predictions. Advances in Geosciences 29:51–59.
https://doi.org/10.5194/adgeo-29-51-2011
49
Conflicts of interest: The authors declare no conflict of interest.
50
Highlights
51