Hydrological Post-Processing Using Stacked Generalization-Paper

Accepted Manuscript
Research papers
Hydrological post-processing using stacked generalization of quantile regres-

sion algorithms: Large-scale application over CONUS
Hristos Tyralis, Georgia Papacharalampous, Apostolos Burnetas, Andreas

Langousis
PII: S0022-1694(19)30677-8
DOI: https://doi.org/10.1016/j.jhydrol.2019.123957
Article Number: 123957
Reference: HYDROL 123957
To appear in: Journal of Hydrology
Received Date: 6 January 2019

Revised Date: 5 July 2019
Accepted Date: 13 July 2019
Please cite this article as: Tyralis, H., Papacharalampous, G., Burnetas, A., Langousis, A., Hydrological post-
processing using stacked generalization of quantile regression algorithms: Large-scale application over CONUS,
Journal of Hydrology (2019), doi: https://doi.org/10.1016/j.jhydrol.2019.123957
This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers
we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and
review of the resulting proof before it is published in its final form. Please note that during the production process
errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
Hydrological post-processing using stacked generalization of quantile
regression algorithms: Large-scale application over CONUS
Hristos Tyralis1, Georgia Papacharalampous2, Apostolos Burnetas3, and Andreas

Langousis4
1Air Force Support Command, Hellenic Air Force, Elefsina Air Base, 192 00 Elefsina,
Greece (https://orcid.org/0000-0002-8932-4997)
2Department of Water Resources and Environmental Engineering, School of Civil
Engineering, National Technical University of Athens, Iroon Polytechniou 5, 157 80
Zografou, Greece (https://orcid.org/0000-0001-5446-954X)
3Department of Mathematics, School of Science, National and Kapodistrian University of
Athens, Panepistemiopolis, 157 84 Athens, Greece (https://orcid.org/0000-0002-9365-
9255)
4Department of Civil Engineering, School of Engineering, University of Patras, University
Campus, Rio, 26 504, Patras, Greece (https://orcid.org/0000-0002-0643-2520)
Corresponding author: Hristos Tyralis (montchrister@gmail.com)
Abstract: Post-processing of hydrological model simulations using machine learning

algorithms can be applied to quantify the uncertainty of hydrological predictions.
Combining multiple diverse machine learning algorithms (referred to as base-learners)
using stacked generalization (stacking, i.e. a type of ensemble learning) is considered to
improve predictions relative to the base-learners. Here we propose stacking of quantile
regression and quantile regression forests. Stacking is performed by minimising the
interval score of the quantile predictions provided by the ensemble learner, which is a
linear combination of quantile regression and quantile regression forests. The proposed
ensemble learner post-processes simulations of the GR4J hydrological model for 511
basins in the contiguous US. We illustrate its significantly improved performance
relative to the base-learners used and a less prominent improvement relative to the
“hard to beat in practice” equal-weight combiner.
Keywords: combining probabilistic forecasts; ensemble learning; hydrological

uncertainty; interval score; quantile regression; quantile regression forests
1. Introduction
An important objective of hydrological models is to predict a variable of interest (e.g.
river discharge or runoff volume), usually referred to as predictand (using the
terminology of Krzysztofowicz 1999), as a response to other hydrological variables
(temperature, precipitation etc.; see e.g. Lidén and Harlin 2000, Mouelhi et al. 2006a, b,
Das et al. 2008, Kaleris and Langousis 2017). In this context, hydrological models can be
classified into three broad categories; i.e. physically based, conceptual, and data-driven
(see e.g. Solomatine and Wagener 2011). The output of the physically based and
conceptual models is point predictions of hydrologic quantities, which do not allow for
direct quantification of predictive uncertainties. To account for the latter, within the
general framework of probabilistic prediction (see e.g. Krzysztofowicz and Kelly 2000,
Krzysztofowicz 2001, 2002, Kavetski et al. 2002, Montanari and Brath 2004, Kuczera et
al. 2006, Todini 2007, Montanari and Grossi 2008, Weijs et al. 2010, Montanari and
Koutsoyiannis 2012, Hernández-López and Francés 2017, Tyralis and Koutsoyiannis
2017), one needs to estimate the probability distribution function (PDF) of the
predictand variable (or the joint probability distribution function of all predictand
variables of interest), corresponding to different uncertainty sources; see e.g. the
detailed review on global uncertainty estimation by Montanari (2011). One way to do so
is to post-process hydrological model outputs using conditional distribution-based
models, regression-based methods, or other algorithmic approaches (see e.g. Li et al.
2017).
Here we are interested in the case where post-processing of hydrological predictions
is conducted using quantile regression-based models; for a detailed review of the
general framework of regression schemes in the context of hydrometeorological post-
2
processing, the reader is referred to Li et al. (2017). Examples of relevant algorithms
include (see e.g. Messner 2018 for a detailed list):
(a) Quantile regression (see e.g. Koenker and Bassett Jr 1978, and Koenker 2005 on
the methodological framework, Ouali et al. 2016 for an application on regional
frequency analysis in hydrology, and Weerts et al. 2011, López López et al. 2014, Dogulu
et al. 2015 for applications on hydrological post-processing, and Section 2.5.1).
(b) Quantile regression neural networks (QRNN, where Artificial Neural Networks
are used to quantify the relationship between predictor variables and conditional
quantiles of dependent variables, see e.g. Taylor 2000, and Bogner et al. 2016 for an
application on flood forecasting systems).
Within the broader class of regression schemes, one can also consider:
(a) Autoregressive models with exogenous variables (ARX, see e.g. Reinsel 1979,
Hannan et al. 1980, Box et al. 2015, and Seo et al. 2006 for an application).
(b) Vector autoregressive models with exogenous variables (VARX, see e.g. Hannan et
al. 1980 on the methodological framework, and Bogner and Pappenberger 2011 for an
application).
(c) Use of ensemble Kalman filtering techniques (see e.g. Kalman 1960, Evensen 1994,
and Vrugt and Robinson 2007 for an application).
(d) Generalized additive models (GAMLSS, where the distribution parameters of
dependent variables are modelled using regression algorithms, see e.g. Rigby and
Stasinopoulos 2005, and Yan et al. 2014 for an application on river storage forecasts).
In an effort to improve the accuracy of hydrological predictions, methods to combine
probabilistic forecasts originating from the application of algorithmic schemes to the
outputs of hydrological models (hereafter referred to as base-learners; see e.g. Alpaydin
3
2014, p. 487) have started gaining prominence. These include Bayesian Model Averaging
(BMA, see e.g. Min and Zellner 1993, Raftery et al. 1997, 2005), non-homogenous
Gaussian regression (NGR, see e.g. Gneiting et al. 2005) and the beta-transformed linear
pool (BLP, see e.g. Ranjan and Gneiting 2010, Gneiting and Ranjan 2013), among other;
see e.g. the reviews in Bogner et al. (2017), Baran and Lerch (2018) and Wang et al.
(2019).
Most regression models belong to the families of Statistical Learning (SL, see e.g.
Hastie et al. 2009; James et al. 2013) or Machine Learning (ML) algorithms, with the
distinction between the two terms being primarily a matter of scientific debate (see e.g.
Bzdok et al. 2018). For brevity, in what follows, we use the term machine learning (ML)
for the algorithms and general methodological framework, and skip the alternative term.
Machine learning algorithms belong to the class of nonparametric methods, thus, not
providing explicit expressions for the PDFs of the obtained forecasts. The latter need to
be estimated independently in the context of each specific application and hydrological
model used, to be properly combined using methods such as BMA, BLP, NGR etc. (see
above).
Recognizing the need to combine probabilistic predictions without obtaining explicit
expressions for the PDFs of the base-learners, Wang et al. (2019) proposed the
constrained quantile regression averaging (CQRA) method to directly combine quantile
forecasts and predict electricity demand. CQRA is based on the minimization of the
quantile score (QS, see e.g. Koenker and Machado 1999, Friederichs and Hense 2007,
Bentzien and Friederichs 2014, referred to as pinball loss in Wang et al. 2019) over all
targeted quantiles and forecast horizons, using linear programming to estimate optimal
weights for all individual probabilistic forecasts. The method is capable of combining
probabilistic forecasts, independent of whether their predictive PDFs exhibit closed
4
forms (e.g. as in Tyralis and Koutsoyiannis 2014). Note that QS has been consistently
used in hydrological post-processing to characterize the distribution properties of
predictand variables (Bogner et al. 2016, 2017), as well as the quality (reliability,
sharpness, level of calibration etc.) of the predictions.
The aim of this study is to propose a novel method to improve probabilistic
predictions provided by single quantile regression algorithms, by combining
probabilistic hydrological forecasts in the absence of explicit expressions for the PDFs of
the base-learners. We are interested in obtaining central prediction intervals; therefore,
the method is based on the minimization of the interval score (IS, also referred to as
Winkler score, Gneiting and Raftery 2007) and combines base-learners using stacked
generalization (stacking, Wolpert 1992), following the CQRA method. Stacking focuses
on the performance of the combination of the algorithms, in contrast to the widely used
in hydrology Bayesian Model Averaging, which may produce largely inaccurate results,
as proved by Yao et al. (2018). Furthermore, it has been suggested that combining
quantile forecasts (as e.g. in the CQRA method) should be preferred compared to
combining distribution forecasts, e.g. in the context of simple averaging (Lichtendahl Jr
et al. 2013).
We introduce the method with the aim to improve probabilistic predictions when
post-processing the outputs of hydrological models. We assess the proposed
methodology by applying it to 511 basins in the contiguous US (CONUS), using
temperature, precipitation and streamflow data sourced from CAMELS (Catchment
Attributes and MEteorology for Large-sample Studies) dataset. Two experiments are
conducted in the 511 basins, i.e. (a) one-step-ahead prediction (see e.g. Evin et al. 2014)
and (b) post-processing of hydrological model simulations. The assessment is of large
scale (see e.g. the review in Beck et al. 2017) and, therefore, it can effectively serve for
5
validation of the introduced method. Large-scale assessments are increasingly used in
hydrological modelling and forecasting (see e.g. Perrin et al. 2001, Mouelhi et al. 2006a,
b, Bourgin et al. 2015, Langousis et al. 2016, Beck et al. 2017, Tyralis and
Papacharalampous 2017, 2018, Bock et al. 2018, Papacharalampous et al. 2018a, b, c,
2019a, c, Tyralis et al. 2018a, b, Xu et al. 2018), as their results are more general than
those of case studies, while only few large scale studies currently appear in the literature
of hydrological post-processing (see e.g. Pagano et al. 2013).
In Sections 2 and 3.3, we introduce the proposed general framework and its technical
aspects. In Section 4, we apply the suggested approach within the concept of
hydrological post-processing for 511 basins (as outlined above), and illustrate its
improved performance relative to the base-learners used. Sections 4.1 and 5 discuss the
obtained results, as well as general concepts regarding the application of the method.
2. Methods
The definitions and nomenclature for the variables, sets, and methods used hereafter,
are detailed in Appendix A. Appendix B outlines the software packages used to
implement the presented methods and illustrations.
2.1 General introduction
Stacked generalization is a type of ensemble learning (Alpaydin 2014, pp. 487–515)
introduced by Wolpert (1992), where the base-learners are combined using another
learner, usually referred to as the combiner learner (see e.g. Alpaydin 2014, p. 504). A
note to be made here is that ensemble learning of ML algorithms should not be confused
with the general concept of ensemble forecasting in hydrology, which implies that the
estimation variance of hydrological quantities can be obtained from the spread of the
ensemble member forecasts originating from different hydrological models (see e.g.
6
Gneiting et al. 2005). In the context of probabilistic forecasts, ensemble learning stands
for the use of multiple ML algorithms to obtain individual probabilistic forecasts, and
their subsequent combination (through a combiner learner) to obtain prediction
intervals. For example, the CQRA method (Wang et al. 2019) relies on weighted
averaging of predictive quantiles of the base learners.
The base-learners used herein are quantile regression (QR) and quantile regression
forests (QRF, Meinshausen 2006); see Section 2.5 for details. QRF is based on random
forests (RF, Breiman 2001), and it has been used for hydrometeorological post-
processing by Taillardat et al. (2016), as well as in other hydrological applications (see
e.g. Bhuiyan et al. 2018). Here QRF is introduced in the context of hydrological post-
processing. For combiner learner we use the weighted sum of the predictive quantiles,
with estimated weights that minimize the IS.
We produce probabilistic streamflow predictions at daily timescale by post-
processing streamflow simulations. The latter are obtained via the GR4J (Génie Rural à 4
paramètres Journalier) lumped conceptual hydrological model introduced by Perrin et
al. (2003). Other hydrological models can be also used; however, our focus here is on the
post-processing procedure, by considering two experiments:
(a) Experiment 1: One-step ahead predictions (as e.g. in Evin et al. 2014), where at
each time step of the prediction period, the base-learners use observed streamflow
information from the previous day, and the same-day hydrological model output.
(b) Experiment 2: Post-processing of hydrological model simulations, where at each
time step, the base-learners use hydrological model outputs for the current and two
previous days.
The proposed framework can also be applied by selecting different predictor
7
variables for the base-learners (as e.g. Ye et al. 2014), and/or used to obtain
probabilistic predictions at multiple steps ahead.
We run the calibrated hydrological model in simulation mode; i.e. we obtain the
streamflow simulations by using recorded temperature and precipitation as input data
(Klemeš 1986; see e.g. Vrugt and Robinson 2007, Montanari and Grossi 2008, Zhao et al.
2011, Evin et al. 2014, Ye et al. 2014, Dogulu et al. 2015). In this way, we assess the
performance of ML algorithms in post-processing hydrological model outputs, avoiding
possible influences imposed by the accuracy of weather forecasts. For the proposed
methodology to be used for forecasting purposes, one needs to run the hydrological
model in forecast mode; i.e. to use temperature and precipitation forecasts, instead of
recorded quantities (Klemeš 1986). In this case, the PDFs of the predictand variables are
expected to be wider (Hemri 2018), reflecting an increase in the uncertainty of the
predictions. The latter, is imposed by the intrinsically uncertain character of the weather
forecasts. An alternative way to use post-processing approaches in forecasting is to train
the post-processor assuming no uncertainty in the inputs, and then combine input
uncertainty and post-processing (see e.g. Krzysztofowicz 1999, Pagano et al. 2013).
2.2 The ensemble learner
In this Section, we present the general framework of the proposed methodology. Brief
descriptions of its specific components are given in Section 2.5. We define the interval
score of base-learner n at time t for a prediction interval 1 – a, 0 < a < 1, as (Gneiting and
Raftery 2007):
Ln,t,a(yn,t,a/2, yn,t,(1 – a/2), yt) := (yn,t,(1 – a/2) – yn,t,a/2) + (2/a) (yn,t,a/2 – yt) 1(yt < yn,t,a/2) + (2/a)
(yt – yn,t,(1 – a/2)) 1(yt > yn,t,(1 – a/2)) (1)
IS is a proper scoring rule to assess the properties of prediction intervals (see e.g.
8
Gneiting and Raftery 2007), which traces back to Dunsmore (1968) and Winkler (1972)
(see e.g. Gneiting and Raftery 2007) and has been used to assess the quality of
hydrometeorological forecasts (see e.g. Hamill and Wilks 1995) and hydrological
predictions (see e.g. Bock et al. 2018, Papacharalampous et al. 2019b, c). The reliability
score, which is related to IS, has been used to assess the performance of algorithms for
hydrological post-processing (see e.g. Ye et al. 2014).
Also, let t  {1, …, n1 + n2 + n3}, where the period T with available observations has
been divided into three consecutive subperiods T1, T2, and T3 containing n1, n2 and n3
values, respectively. The stacked algorithm is trained in the period {T1, T2}, whereas
period T3 (i.e. an independent period with data not used for training) is used to test the
stacked algorithm. In what follows, we outline the algorithmic steps used to combine the
probabilistic predictions for a specific prediction interval 1 – a (see also Figure 1 for an
illustration):
Step 1 (Train the base-learners in subperiod T1): Each of the n base-learners fn,q(∙), q
 {a/2, 1 – a/2}, is trained independently in subperiod T1, using xt as predictor variables
and yt as dependent variables, where t  T1.
Step 2 (Use the base learners to obtain predictions in subperiod T2): The trained
base-learners of step 1 are used to predict yn,t,q ∀ t  T2, n  N, q  {a/2, 1 – a/2}, where
xt are used as predictor variables of the trained base-learners; i.e. yn,t,q = fn,q(xt).
Step 3 (Stacked generalization): The quantity ∑t Lt,a(yt,a/2, yt,(1 – a/2), yt) is minimized in
subperiod T2, where yt,q = ∑n wn,a yn,t,q , q  {a/2, 1 – a/2}, subject to the constraints
∑n wn,a = 1 and wn,a  [0, 1], n  N. The aim is to obtain proper weights wn,a that minimize
the total loss over different times t; i.e. ∑t Lt,a(yt,a/2, yt,(1 – a/2), yt).
Step 4 (Retrain the base-learners using the whole training period {T1, T2}: Each of the
9
n base learners fn,q(∙), q  {a/2, 1 – a/2}, is trained independently again in the period {T1,
T2} using xt as predictor variables and yt as dependent variables, t  {T1, T2}.
Step 5 (Obtain predictions in test period T3): The predictive quantile yt,q, q  {a/2, 1 –
a/2}, at time t  T3 for a given predictor variable xt, is calculated as yt,q = fe,q(xt), where
fe,q denotes the weighted sum (with weights estimated in Step 3) of the quantiles
obtained from the base-learners trained in period {T1, T2}.
Figure 1. Illustration of the steps of the proposed algorithm. Green horizontal lines refer
to the periods of the data used.
To calculate the weights of the ensemble learner, we employ two alternative
approaches. In the first approach (termed as ensemble learning method 1), for each
value of a, steps 1–5 are applied leading to weight combinations for the base learners
that differ for each prediction interval 1 – a. In the second approach (termed as
ensemble learning method 2), step 3 is modified to minimize the quantity ∑a ∑t Lt,a(yt,a/2,
yt,(1 – a/2), yt) (i.e. the total loss over several prediction intervals and times), instead of ∑t
Lt,a(yt,a/2, yt,(1 – a/2), yt). Hence, in the second approach, the obtained weight combinations
for the base-learners are invariant with respect to the prediction interval 1 – a, i.e. wn
are estimated (instead of wn,a).
2.3 Comparison to other methods
The ensemble learners 1 and 2 are compared to:
10
(a) The simple averaging approach, which assigns equal weights (i.e. in our case ½ for
two base learners) to all quantile forecasts. Simple averaging is an important
benchmark, as it corresponds to “an equally weighted opinion pool that is hard to beat in
practice” (see e.g. Lichtendahl Jr. et al. 2013). Simple averaging has been exploited in
Papacharalampous et al. (2019b, c) in a different context; i.e. by averaging multiple
quantile predictions (on the order of hundreds) obtained using simulations from a single
hydrological model. In this case, simple averaging was selected as an alternative to
weight optimization, which may result in prohibitive computational requirements, due
to the large number of applied weights.
(b) Ensemble learners 3 and 4, which correspond to ensemble learners 1 and 2 (see
previous Section) respectively, with the difference that Step 4 of the algorithm (i.e.
retraining of the base learners in period {T1, T2}) is omitted. Thus, prediction is made
using the trained base learners of Step 1. This comparison allows quantification of the
information gain when retraining the base learners in a longer period (i.e. {T1, T2}).
(c) The QR and QRF base learners used to form the ensemble learners.
2.4 Specific remarks on the proposed algorithm
2.4.1 Fundamental concepts
The proposed algorithm borrows concepts from the fields of hydrology, machine
learning, and statistics. The first basic concept, originating from the field of statistics, is
use of the interval score (IS) defined in eq. (1). Use of IS is substantiated by theoretical
arguments (see e.g. Gneiting and Raftery 2007), with lower values indicating better
performance of the base-learners. IS penalizes wider prediction intervals through
component (yn,t,(1 – a/2) – yn,t,a/2) in eq. (1), as well as intervals that do not contain
observations (i.e. through the component of eq. (1) that remains, after subtraction of the
11
interval width). The latter penalty (hereafter referred to simply as penalty) increases
with the distance of the observations outside the prediction interval and, although more
general, it is implicitly linked to the reliability score (RS), which is defined here for base-
learner n as:
RSn,a := ∑t 1(yn,t  [yn,t,a/2, yn,t,(1 – a/2)])/|T| (2)
An optimal RS should have value equal to 1 – a; i.e. 1 – a of the observed values should
fall inside the 1 – a prediction interval. Ranking of methods can be conducted by
averaging the implemented scores over a fixed set of forecasts (see e.g. Gneiting and
Raftery 2007), with better performing methods exhibiting lower scores.
A second concept, borrowed from the field of machine learning, is implementation of
stacked generalization (see Step 3 above). Stacked generalization (or stacking) is a type
of ensemble learning introduced by Wolpert (1992) (see e.g. Alpaydin 2014, p. 504 for a
comprehensive description of the algorithm), where a combiner learner is used to
“improve” the predictions of the base learners (see e.g. Breiman 1996a, Smyth and
Wolpert 1999), with the latter been used as input. Under this setting, the base learners
and the combiner learner need to be trained over different sets. Here, this is achieved by
splitting the training period into two subperiods T1 and T2. Simultaneous fitting of the
ML algorithms (i.e. Step 1 above) and estimation of the weights (i.e. Steps 2 and 3 above)
using the whole {T1, T2} period is generally not recommended, as ML algorithms tend to
overfit, leading to superior algorithmic performances in the training set relative to
independent test sets. This has been verified also in the context of the present study,
where we found that QRF completely dominated QR.
Other ensemble learning methods also exist; see e.g. the review by Sagi and Rokach
(2018). Two of the most widely used are bagging (Breiman 1996b) and boosting
12
(Friedman 2001, see also the reviews by Natekin and Knoll 2013, and Mayr et al. 2014).
Bagging averages multiple weak learners (i.e. learners with low performance, or
unstable learners), while in boosting new weak base learners are progressively
introduced and trained to minimize the error of the ensemble learner following an
iterative procedure. Thus, new models are progressively added to the ensemble. Instead,
stacking (which is a meta-learning method) uses diverse base-learners to gain in
performance (Sagi and Rokach 2018).
The overall performance of the ensemble learner (formed by the combiner learner
and the base learners) depends on the efficiency of the combiner learner to properly
weigh the base learners within a given test set, and depends on the effectiveness of its
calibration in the training period T2, as well as potential similarities between periods T1
and T2. The splitting problem of a set into training and validation periods is common to
all areas of hydrology and machine learning, and addressing it goes beyond the scopes of
the present study. Here, the training set is partitioned into two subperiods (i.e. T1 and
T2) of almost equal lengths (i.e. 8 and 6 years, respectively), whereas the test set (i.e.
subperiod T3) includes 30% of the available data (i.e. 6 out of 20 years); see Section 3.2
for details. Similar relative lengths for the corresponding training and test periods have
been used in other ML studies; see e.g. (Antal et al. 2003; Yu and Xu 2008;
Papacharalampous et al. 2019a). The overall results can be considered reliable as the
length of the available data allows examination of various patterns of low and high
flows, as well as other statistical attributes of the data.
2.4.2 Differences from existing frameworks
The proposed algorithm borrows concepts from Wang et al. (2019) and Trapero et al.
(2019), who used QS as a loss function, and Yao et al. (2018) who combined closed
13
expressions of probabilistic forecasts. In the former two studies, the weights of the base-
learners were estimated by minimizing the QS across all targeted quantiles and forecast
horizons. Here, we are interested in estimating optimal prediction intervals (i.e. pairs of
quantiles in the form of prediction ranges) and, therefore, minimization of IS is more
suitable than minimization of QS. The latter would lead to doubling the number of the
applied weights (i.e. one weight per bound in QS, vs. one weight per interval in IS), thus
increasing the uncertainty of the resulting predictions. We also note that existing
applications of QS and IS concepts are limited to fields outside hydrology. Additional
advantages when using IS minimization to combine probabilistic forecasts, relative to
other methods (e.g., BMA, BLP, NGR; see Introduction), are that: a) the weight search can
be formulated as a linear programming problem, with considerable increase of accuracy
and computational efficiency of the algorithm, and b) quantile crossing issues are
minimized, as the obtained weights do not depend on posterior distributions that may
present multi-modal features (for an extensive discussion on the merits of stacked
generalization relative to BMA, the reader is referred to Wang et al. 2019). Further
advantages of the method inherit from the properties of stacked generalization, which is
a general methodology with deep theoretical background (for details see Wolpert 1992),
and the fact that the method is simple, straightforward to use, computationally efficient
(i.e. it takes approximately 45 min to process 511 basins with 30 years of data each
including hydrological model simulations, on a regular PC), and practical due to its full
automation (Trapero et al. 2019).
2.5 Base-learners
General guidelines for the selection of base-learners are presented in Alpaydin (2014,
pp. 488–491). In brief, the base-learners should be simple, accurate, and diverse, so they
14
complement each other. Here we use QR and QRF as base-learners, but the method can
combine more than two quantile regression base-learners. The ensemble learner can
also include different base-learners, which originate from the same ML algorithm (e.g.
QRF), implemented with different parameters or predictor variables. Two reviews on
quantile regression algorithms detailing recent progress in the field can be found in
Koenker (2017) and Waldmann (2018). In general, quantile regression algorithms
model the conditional quantiles of dependent variables as functions of predictor
variables. While a detailed presentation of the implemented ML algorithms goes beyond
the scopes of the present study, brief descriptions of the methods and software packages
used for their implementation are presented below.
2.5.1 Quantile regression
Linear in parameters quantile regression (QR) was introduced by Koenker and Bassett
(1978), while an extended treatment of the method can be found in Koenker (2005). The
method uses similar techniques to linear regression, to estimate the quantiles of a
dependent variable, conditional on predictor variables. Its main difference relative to
linear regression, is that minimization is conducted in terms of conditional quantiles,
whereas linear regression considers the conditional mean of the response variable. An
intuitive explanation of QR is that it fits a linear model and bisects the data so that
100 q% lie below the predicted values of the fitted model. Practically, this is done by
fitting a linear model to the data and minimizing the average QS. The method is suitable
for modelling heteroscedasticity (Koenker 2005, p. 25). We apply the method using the
rq R function of the quantreg R package (Koenker 2018), which implements the fitting
algorithm proposed by Koenker and d'Orey (1987, 1994).
15
2.5.2 Quantile regression forests
Quantile regression forests (QRF) were introduced by Meinshausen (2006). The
algorithm is based on random forests (RF, Breiman 2001, see also Biau and Scornet
2016), with interest being on conditional quantiles, rather the conditional mean. RF is a
very accurate algorithm, as proved by its performance in practical problems and
competitions. Examples include successful use of RF in hydrology (see e.g. Nikolopoulos
et al. 2018, Papacharalampous and Tyralis 2018), point time series forecasting in
hydrometeorological applications (see e.g. Tyralis and Papacharalampous 2017,
Papacharalampous et al. 2018a, 2019a) and spatial interpolation of hydrological
quantities (see e.g. Tyralis et al. 2018, 2019b). An extensive review on the use of RF in
water sciences can be found in Tyralis et al. (2019a), and a detailed description of the
algorithm can be found in Hastie et al. (2009, pp. 587–604).
In a regression setting, random forests average an ensemble of decision trees. The
ensemble is created by bagging (abbreviation for bootstrap aggregation; Breiman
1996b) regression trees. In addition to bagging, the splitting at the nodes of the
regression tree is conducted by randomly selecting a fixed number of predictor
variables, thus inducing an additional degree of randomization, which increases
accuracy of the algorithm.
Similar to random forests, which approximate conditional means, quantile regression
forests approximate conditional quantiles. This is done by averaging the indicator
functions of the events exhibiting decision tree outcomes in the test set, lower than a
predefined quantile level.
Here, we apply QRF using the quantile_forest R function of the grf R package
(Tibshirani et al. 2018), which emulates Meinshausen’s (2006) algorithm (see also Athey
16
et al. 2019). The corresponding algorithm is straightforward and very simple to use,
with a few parameters to tune, while the default values in the software implementation
are near optimal (see e.g. the discussion in Verikas et al. 2011, Oshiro et al. 2012,
Scornet et al. 2015, Biau and Scornet 2016, Probst and Boulesteix 2018, Tyralis et al.
2019a). Therefore, optimization of the algorithm is omitted considering, also, that
interest is in the relative improvement of the combiner learner with respect to the base-
learners used. Other properties of random forests are that they demonstrate high
predictive performance, they are non-linear and non-parametric, they are fast compared
to other machine learning algorithms, and they are stable and robust to the inclusion of
noisy predictor variables, while they do not extrapolate outside the training range
within the test set (see e.g. Biau and Scornet 2016, Tyralis et al. 2019a).
3. Data and models
3.1 Data
A detailed presentation of CAMELS dataset, used in the present study, can be found in
Addor et al. (2017a, b), Newman et al. (2014, 2015, 2017) and Thornton et al. (2014).
The dataset comprises of daily hydrometeorological and streamflow data from 671
small- to medium-sized basins in CONUS. For each basin, the daily minimum and
maximum temperatures and precipitation have been obtained by processing the daily
dataset of Thornton et al. (2014). Changes in the basins due to human influences are
minimal, therefore use of ML algorithms for uncertainty characterization is an
acceptable option; see e.g. Solomatine and Wagener (2011) regarding the requirements
for statistical similarity between subperiods when applying ML methods, and
Koutsoyiannis and Montanari (2015) regarding the appropriateness of the assumption
of stationarity when changes cannot be explained deductively. Here we focus on the 34-
17
year period 1980-2013, and exclude basins with missing data or other inconsistencies.
The final sample consists of 511 basins representing most climate types over CONUS;
see Figure 2.
Figure 2. The 511 basins over CONUS used in the study.
For each of the 511 basins, we estimate the mean daily temperature as the average of
the respective minimum and maximum daily temperatures. The daily potential
evapotranspiration (PET) is estimated by implementing Oudin’s formula (Oudin et al.
2005). For the latter, we use the PEdaily_Oudin R function of the airGR R package
(for details see Coron et al. 2017, 2018), with the daily mean temperature as input.
3.2 Hydrological model
The GR4J model constitutes an improvement of the GR3J (Génie Rural à 3 paramètres
Journalier) model by Edijatno et al. (1999), and comprises of four parameters, while its
precursor (i.e. GR3J) comprises of three parameters (Perrin et al. 2003). The use of this
small number of parameters is fully justified in Perrin et al. (2001). The hydrological
model is herein calibrated in a non-adaptive way; i.e. the calibration is performed once
for each basin and the hydrological model is thereafter applied with fixed parameter
values (see e.g. Toth et al. 1999). Although feasible, we do not perform adaptive
calibration (see e.g. Brath and Rosso 1993, Ye et al. 2014), as its benefits are delivered
by the base-learners in the context of the hydrological post-processing framework (Toth
18
et al. 1999).
We use the airGR R package to apply the GR4J hydrological model to each basin. We
simulate daily streamflow with recorded daily precipitation and PET as input. The
period 1980-1981 is used to warm up the hydrological model, while period 1982-1993
is used for model calibration using the Calibration_Michel R function of the
airGR R package. The latter function implements Michel’s (1991) optimization
algorithm using the Nash–Sutcliffe criterion (Nash and Sutcliffe 1970), to characterize
the quality of the hydrological simulations relative to recorded streamflows.
Following the notation presented in Section 2.2, we define the periods: T1 = {1994-01-
01, …, 2001-12-31}, T2 = {2002-01-01, …, 2007-12-31}, T3 = {2008-01-01, …, 2013-12-
31}, and use the calibrated hydrological model to simulate daily streamflows for the
total period T = {T1, T2, T3}. The simulated streamflow vt at time t is calculated using
information until day 1993-12-31 for yt (i.e. the recorded streamflow), and until day t for
prt and pett (i.e. precipitation and potential evapotranspiration, respectively). The final
product consists of 511 simulated streamflow series at a daily resolution in period T,
with a total of 1 120 112 simulated values in period T3, where the ensemble learner is
tested (i.e. 2 192 simulated streamflow values for each of the 511 basins).
3.3 Technical aspects
Post-processing aims at estimating the uncertainty of the predictand variable
conditional on model simulations (see Introduction). In Experiment 1 (i.e. one-step
ahead predictions; see Section 2.1) the predictor variable is defined as xt = (yt – 1, vt). Use
of the last streamflow observation yt-1 as predictor is supported by numerous relevant
examples (see e.g. Krzysztofowicz 1987, Seo et al. 2006, Evin et al. 2014, Bogner et al.
2016), due to the high magnitude of dependence between sequential streamflow
19
observations. In Experiment 2, the predictor variable is defined as xt = (vt, vt – 1, vt – 2) and
corresponds to the case of post-processing hydrological model simulations. The q-th
quantiles of the predictand can be obtained by post-processing xt through: yntq = fe,q(xt).
When using ML algorithms, it is common to pre-process the data by applying some
transformation, with the aim to increase the performance of the model. Appropriate
transformations can be applied to both yt and vt. Several options are available in the
existing literature, such as the arcsinh(∙), log(∙), square root, Box-Cox, and Yeo-Johnson
transformations, as well as the normal quantile transformation (see e.g. Krzysztofowicz
1997, Bogner et al. 2012). All aforementioned normalization transformations can be
implemented using the bestNormalize R package (Peterson 2018). The selected
transformation should be applied to both simulated and observed streamflows in the
training sets, and all ML calculations should be performed using transformed quantities.
The inverse transformation is then applied to the predicted quantiles. We tried all
previously mentioned transformations, and found that the square root transformation
was the only one not resulting in unrealistically high quantiles by the QR algorithm in
the examined cases. When compared to conventional statistical methods (Waldmann
2018), QR is more robust and less sensitive to the existence of outliers of the dependent
variable, while RF is invariant to monotonic transformations of the predictor variables
(Díaz-Uriarte and De Andres 2006). The square root transformation has also been used
by Messner (2018) for the purposes of hydrometeorological post-processing.
Other issues, which need to be addressed in most post-processing applications,
include heteroscedastic behaviour of the data, censoring (i.e. in case the predicted
quantiles exhibit negative values), and quantile crossing problems. Regarding
heteroscedasticity, it can be theoretically addressed by using base-learners that can
effectively model heteroscedastic behaviour, such as the QR used in the present study.
20
Problems of negative quantiles were minimal in the present application. In the case of
QRF base-learners, negative values are by definition not possible, as the predicted
quantiles constitute subsets of the values found in the training set. For the QR base-
learners, the problem of negative quantiles was addressed by censoring them. Quantile
crossing problems were also minimal in the present application, and have been
addressed by properly adjusting the corresponding quantiles, similar to the approach of
Wang et al. (2019). According to the latter, if quantile q1 results to be larger than
quantile q2, with q1 < q2, then quantile q2 is set equal to quantile q1.
4. Results and Discussion
For period T3, Figure 3 summarizes information on the simulated and observed
streamflows for all basins analysed. Figure 3.a presents a scatterplot for the same- and
previous-day observed discharges, yt and yt – 1, respectively, while Figures 3.b – d show
scatterplots of yt with respect to the simulated streamflows vt , vt – 1 and vt – 2,
respectively. Regarding Figure 3.b, one sees that the linear regression line (red) between
the same-day hydrological simulations and observations is close to the 45-degree line
(black), indicating that the hydrological model pre-processes the data relatively well.
However, there seems to be a moderate negative bias in the estimation of high flows, as
indicated by the points lying above the 45-degree line. Also, as physically expected, the
deviation between observed and simulated flows increases with increasing lag-times;
see Figures 3.c, d.
21
Figure 3. Scatterplots of yt versus: (a) yt–1, (b) vt, (c) vt–1, and (d) vt–2 for all basins, and t
in the T3 period. The 45-degree line (black) and the linear regression line (red) between
the variables of the two axes are also presented.
Figure 3.a illustrates the significant positive correlation of observed streamflows in
two sequential days (i.e. yt, and yt – 1), indicating the appropriateness of using xt = (yt – 1,
vt) as a predictor variable in hydrological post-processing. Clearly, the deviation of the
linear regression line in Figure 3.a from the 45-degree line is larger than that in Figure
3.b, indicating the important pre-processing role of the hydrological model. Regarding
high flows, the respective points in Figure 3.a are scattered symmetrically around the
45-degree line. This is statistically justifiable, as the probabilities of observing higher or
lower flows in day t with respect to day t – 1 are approximately equal.
22
The appropriateness of using xt = (yt – 1, vt) as a predictor variable in hydrological
post-processing is also illustrated in Figure 4, which shows histograms of correlations
(obtained for validation period T3 and all considered basins) between yt and (a) yt–1, (b)
vt, (c) vt–1, and (d) vt–2. One sees that the correlations between yt, and each of the
variables yt–1 and vt are generally higher relative to the correlations between the
observed streamflow yt at time t, and the simulated streamflows at earlier times (i.e. vt–1
and vt–2). Correlation histograms obtained for validation period {T1, T2} are similar to
those in Figure 4 (not shown here for brevity).
Figure 4. Histogram of correlations obtained for validation period T3 between: (a)

current-day yt and previous-day yt-1 observed streamflows, (b) current-day observed yt
and simulated vt streamflows, (c) current-day observed yt and previous-day simulated vt-
1 streamflows (i.e. 1 day lag time), and (d) current-day observed yt and simulated vt-2
streamflows (i.e. 2 days lag time).
Figure 5 shows an example of post-processed hydrological simulations in the context
of experiments 1 and 2 at an arbitrary basin. The 0.025 and 0.975 quantiles of the base-
learners QR and QRF, and ensemble learners 1 and 2 are also presented. Visual
inspection of the post-processed simulations indicates that QR, QRF, and ensemble
23
learners 1 and 2 produce intervals that, in general, include yt. In experiment 2, the
prediction intervals are wider, due to the larger degree of uncertainty induced by the
absence of the previous-day observed streamflow yt–1 as predictor variable. In the next
Section, we quantitatively assess the performance of each method (including ensemble
learners 3 and 4, as well as simple averaging) using proper metrics.
Figure 5. Illustration of observed streamflows (black solid lines), and predicted

quantiles (dotted lines) for: (a) experiment 1, and (b) experiment 2, for a 1-year sub-
period of T3 at an arbitrary basin. Different post-processing methods are indicated with
different colours.
4.1 Performance assessment
For brevity, and without loss of generality, in what follows we centre the discussion to
experiment 1. The results of experiment 2 are presented through comparison of
24
performances relative to experiment 1. For all basins analysed, we assess the predictive
performance of ensemble learners 1–4 and the simple averaging method in period T3.
The assessment is made by estimating the relative improvement (RI) introduced with
respect to each of the base-learners. For instance, the relative improvement (RI) of the
interval score of learner i with respect to the nth base-learner (used for benchmarking) is
defined as:
RI := (Ln,a - Li,a)/ Ln,a (3)
Similarly, by substituting the interval score by its components, i.e. interval widths, and
penalty (see Section 2.4.1), one can obtain their relative improvements as well; see
Section 4.2 for a detailed analysis and presentation of findings.
Regarding experiment 1, Figure 6.a shows the mean RI (over all basins) of ensemble
learners 1–4 and simple averaging with respect to QR, for different prediction intervals
1 – a = 20, 40, 60, 80, 90, 95%. Figure 6.b presents similar results to Figure 6.a, but for
experiment 2.
Figure 6. Mean relative improvement (over all basins) of the interval score (IS) with
respect to QR in: (a) experiment 1, and (b) experiment 2, for different prediction
intervals 1 – a = 20, 40, 60, 80, 90, 95%.
A positive value of RI indicates that the examined learner improves over the
benchmark learner. Values equal to 0 indicate that the examined and benchmark
learners are identical. Regarding experiment 1, when compared to QR, ensemble
25
learners 1 and 2 improve more than 10% at prediction intervals below 80%, while the
relative improvement decreases to 5% at higher prediction intervals. When compared to
QRF, the relative improvement is 1-2% at low prediction intervals, and increases to
more than 5% at higher prediction intervals. Lower prediction intervals are
representative of the median values of the streamflow, whereas higher prediction
intervals can be used to predict low and high flows. The diverse properties of the two
base-learners with respect to the magnitude of the prediction interval are also
presented in Figure 6. The performance of the QRF base-learner is 9% better relative to
QR at low prediction intervals, and it decreases at higher prediction intervals. A
probable reason for this is that, by construction, QRF cannot predict beyond the range of
observed flows in the training set, whereas the QR algorithm is regression based
allowing for extrapolation beyond this range.
At all prediction intervals considered, the performances of ensemble learners 3 and 4
are approximately 2% lower compared to ensemble learners 1 and 2, respectively.
Consequently, retraining of the algorithm in period T2 (Step 4) is beneficial and should
be preferred. The relative improvement of ensemble learner 1 over simple averaging is
approximately 2% at prediction intervals below 80%, with the two methods sharing
similarly good performances at higher prediction intervals.
Regarding experiment 2 (see Figure 6.b), one sees that the RI curves are shifted
downwards relative to Figure 6.a, indicating lower overall performances associated with
the larger degree of uncertainty (relative to experiment 1) induced by the absence of the
previous-day observed streamflow as predictor variable. In addition, while ensemble
learners 1 and 2 perform better than the base-learners, simple averaging performs
equally well to both ensemble learners 1 and 2 at all prediction intervals. This important
result indicates that the outcome of optimal weight selection is strongly influenced by
26
the uncertainty of the predictor-predictand relationship. More precisely, as the level of
uncertainty increases (e.g. experiment 2 relative to experiment 1) weight optimization
may not lead to significant improvements relative to simple averaging; i.e. a uniform
weighting scheme that assigns equal weights to all base-learners.
When averaged over all prediction intervals, the relative improvement of the interval
score of ensemble learner 1 in experiment 1 is 8.84% with respect to QR, and 4.43%
with respect to QRF. The corresponding improvements introduced by ensemble learner
2 are 8.55% with respect to QR and 4.18% with respect to QRF, and by simple averaging
are 7.90% and 3.60%, respectively. The slight improvement of ensemble learner 1 in
experiment 1 should be attributed to its higher flexibility, not compromised by
overfitting. Clearly, the two ensemble learners are able to exploit the diverse properties
of the base-learners and improve uniformly over them, demonstrating improved
performance relative to simple averaging by approximately 1%. Also, it follows from the
discussion above, that the first ensemble learner is approximately 0.5% more efficient
relative to the second one. The reason for this is that the first learner uses a combiner
algorithm that allows for additional degrees of freedom, as the weights applied to the
base-learners may vary with the prediction interval 1 – a (see Section 2.2). Note that the
aforementioned increase of predictive performances over the base-learners is
significant, especially due to the size of the test set (i.e. 511 time series, each one
consisting of 34 years of daily streamflow observations). For example, in their study,
Wang et al. (2019) indicate 4.39% average relative improvement of the quantile score
with respect to the three base-learners used, based on eight daily time series of
electricity consumption, each one consisting of four years of data. Although smaller (due
to the larger uncertainty of the predictor-predictand relationship; see above),
improvements of ensemble learners 1 and 2 over the base-learners are also observed in
27
experiment 2 (see Figure 6.b). In addition, both ensemble learners appear to be overall
equivalent to simple averaging, indicating that weight optimization does not lead to
significant improvements relative to the uniform weighting scheme of simple averaging.
Figure 7 presents histograms of the relative improvements of the IS (see eq. (3))
introduced by the two ensemble learners in experiment 1, for all considered basins,
relative to the two base-learners. Each histogram consists of 3 066 values, which
correspond to six values (i.e. one per prediction interval 1 – a = 20, 40, 60, 80, 90, 95%)
per basin. In all cases, the improvements are mostly positive and well dispersed,
indicating that the results presented in Figure 6 (i.e. the mean relative improvement of
each ensemble learner relative to the two base learners) are not dominated by
exceptional performances of the ensemble learners over a limited set of basins.
Figure 7. Histograms of relative improvements in terms of IS, as computed for all basins
and prediction intervals in experiment 1, for ensemble learner 1 (left panels) and
ensemble learner 2 (right panels). The relative improvements with respect to quantile
regression (QR) are illustrated in the top panels, and with respect to quantile regression
forests (QRF) in the bottom panels.
28
Figure 8 and Figure 9 present boxplots of the average interval scores (IS) in
experiment 1 and experiment 2, respectively, for period T3. One sees that: a)
independent of the experiment and method used, IS increases with increasing 1 – a; b)
in both experiments, ensemble learners 1-4 and simple averaging improve over the
base-learners, and c) IS in experiment 1 are generally lower than those in experiment 2,
thus confirming that yt–1 (i.e. used as predictor variable in experiment 1) is more
informative than vt–1 and vt–2 combined (i.e. used as predictor variables in experiment 2).
29
Figure 8. Notched boxplots of average interval scores for experiment 1 in period T3, for
different prediction intervals 1 – a = (a) 20, (b) 40, (c) 60, (d) 80, (e) 90, (f) 95%. The
lower and upper hinges of the boxes correspond to the first and third quartiles. Values
exceeding the third quartile by more than 1.5 times the interquartile range, are
considered as outliers (denoted by dots).
30
Figure 9. Notched boxplots of average interval scores for experiment 2 in period T3, for
different prediction intervals 1 – a = (a) 20, (b) 40, (c) 60, (d) 80, (e) 90, (f) 95%. The
lower and upper hinges of the boxes correspond to the first and third quartiles. Values
exceeding the third quartile by more than 1.5 times the interquartile range, are
considered as outliers (denoted by dots).
4.2 Components of the interval score
To gain further insight regarding the performance of each method in the testing period
T3, Figure 10 presents, for both experiments, the ensemble mean (over all basins) of the
absolute differences between the reliability scores (see Section 2.4.1) and the
corresponding nominal prediction intervals.
31
Figure 10. Ensemble mean (over all basins) of the absolute differences between the
reliability scores and the corresponding nominal values, for prediction intervals 1 – a =
20, 40, 60, 80, 90, 95%, in: (a) experiment 1, and (b) experiment 2.
One can see that, in experiment 1, QR performs better than QRF at prediction
intervals below 60%, whereas the performances are reversed at higher prediction
intervals. The differences are less pronounced in experiment 2, with QR performing
better than QRF at all prediction intervals. In both experiments, ensemble learners 3 and
4 demonstrate limited performance relative to ensemble learners 1 and 2, with the latter
two exhibiting similar performances to simple averaging, balancing those of QR and QRF
base-learners.
Figure 11 presents the median relative improvements (i.e. with respect to QR) of the
performances of base-learner QRF, ensemble learners 1-4, and simple averaging, in
terms of prediction interval widths. Median values are preferred over mean values, to
avoid influences by very low (i.e. near-zero) prediction intervals. While the
performances of all methods are comparable in experiment 2 (see Figure 11.b) due to
the higher level of uncertainty induced by the absence of the previous-day observed
streamflow as predictor variable, in experiment 1 (see Figure 11.a) QR uniformly
dominates QRF and all ensemble methods used.
32
Figure 11. Median relative improvement (over all basins) of interval widths with respect
to QR in: (a) experiment 1, and (b) experiment 2, for different prediction intervals 1 – a
= 20, 40, 60, 80, 90, 95%.
Figure 12 presents the ensemble mean (over all basins) of the relative improvements
(with respect to QR) of penalties associated with intervals that do not contain
observations (see Section 2.4.1). The general pattern is similar to that of interval scores
in Figure 6, indicating that penalties are an important contributor to the interval score.
Figure 12. Mean relative improvement (over all basins) of penalties with respect to QR
in: (a) experiment 1, and (b) experiment 2, for different prediction intervals 1 – a = 20,
40, 60, 80, 90, 95%.
4.3 Weights
To gain insight on how the weights of the ensemble learners (see Section 2.2) are
affected by the performances of the base-learners, Figure 13 shows for ensemble learner
1, scatterplots of the weights assigned to the QR base-learner as a function of the relative
improvement of the average interval score of QRF relative to QR, for different prediction
33
intervals 1 – a. As expected, one sees that independent of the experiment (i.e. 1 Figure
13.a, or 2 Figure 13.b) the weights assigned to the QR base-learner tend to decrease
when the relative improvement of QRF relative to QR increases.
Figure 13. Scatterplots of the weights of the quantile regression algorithm exploited
through ensemble learner 1 in: (a) experiment 1, and (b) experiment 2, against the
relative improvement of the average interval score of QRF relative to QR in period T3.
For ensemble learner 1 and experiment 1, Figure 14 presents histograms of the
weights assigned to the QR base-learner for varying prediction intervals 1 – a. When the
prediction interval 1 – a increases, the weights increase as well. This should be expected
because the relative gain in performance when referring to the interval score of QR with
respect to QRF increases for higher prediction intervals. Spikes at the edges of the
histograms correspond to cases where the QR base-learner completely dominates QRF
(or vice versa) resulting in weights equal to 1 (or 0 respectively).
34
Figure 14. Histograms of the weights of the quantile regression (QR) algorithm exploited
through ensemble learner 1 in experiment 1 for different prediction intervals 1 – a = (a)
20, (b) 40, (c) 60, (d) 80, (e) 90, (f) 95%.
5. Concluding remarks
Ensemble learning of base-learners can result in improved performance of probabilistic
predictions. The few existing methods require formal definition of the likelihoods of the
base-learners, which is too restrictive, as most base-learners cannot provide explicit
expressions for the PDFs of the obtained forecasts. In this study, we borrowed concepts
from Wang et al. (2019) to propose an ensemble learner, which uses stacked
generalization to linearly combine the quantile predictions of non-parametric base-
35
learners (i.e. quantile regression and quantile regression forests algorithms), using
weights that minimize the interval score of the resulting prediction.
The method was tested using a large dataset consisting of 511 basins. The conducted
tests focused in delivering one-step ahead predictions (experiment 1), as well as in post-
processing simulations of a conceptual hydrological model (experiment 2). It was found
that the ensemble learners improve over the performance of the best base-learner by 1-
5%, depending on the experiment and the prediction interval. The suggested method
was also found to outperform simple averaging (i.e. a uniform weighting scheme that
assigns equal weights to all base learners), or to be sharing the first place with it in all
examined cases, with the maximum obtained improvement over this tough benchmark
being approximately equal to 2%.
The results are considered significant, especially given the length of the sample the
algorithm has been tested (i.e. post-processing of 1 120 112 hydrological predictions
from 511 time series) and the fact that simple averaging is hard to beat in practice (see
e.g. Lichtendahl Jr et al. 2013). The latter general observation indicates that when the
uncertainty of the predictor-predictand relationship increases, the effectiveness of
weight optimization tends to decrease, approaching that of simple averaging; i.e. a
uniform weighting scheme with equal weights assigned to all base-learners.
To the best of our knowledge, no similar study has been conducted in the
hydrological literature, with the closest work being that of Wang et al. (2019) in
electricity forecasting. The latter is based on minimization of the quantile score (QS, see
Introduction), indicating 4.39% average relative improvement with respect to the three
base-learners used, whereas in the present study ensemble learning is conducted by
minimizing the interval score (IS), resulting e.g. in experiment 1 to approximately 6.5%
average relative improvement over the 2 base-learners (i.e. (8.98% + 4.44% + 8.66% +
36
4.18%)/4; see Section 4.1). Also, note that application of the constrained quantile
regression averaging (CQRA) method of Wang et al. (2019) was based on eight daily
time series of electricity consumption, each one consisting of four years of data.
One should consider the convenience of using the proposed method over other
combination methods (e.g. Bayesian Model Averaging), as well as theoretical studies that
support: a) stacking against Bayesian Model Averaging, and b) working with quantile
forecasts instead of probability distributions (see also Section 1). The extended use of
machine algorithms in hydrological post-processing should also be considered. Machine
learning algorithms are accurate, they have been tested extensively in practice as well as
in forecasting competitions, they are easy to apply due to their open software
implementation, and they are optimally programmed, resulting in considerable decrease
of computation times (the computations of the present study, including fitting of the
hydrological model, required approximately 45 min on a regular PC), thus allowing
large-scale implementations.
Based on the aforementioned findings, we recommend use of the proposed ensemble
learners for improving the probabilistic predictions of base-learners. Future research
could focus on defining optimal splitting points of the training set used, inclusion of
more base-learners, testing the method using forecasts of daily temperature and
precipitation as input, as well as assessing the performance of stacked generalization
when metrics/scores, other than IS (see e.g. Gneiting and Raftery 2007 and Shastri et al.
2017), are minimized to optimally combine probabilistic forecasts. Further uses of the
method are also possible, spanning from hydrological forecasting using data-driven
models, to water demand forecasting, to water science problems, and beyond; e.g. in
electricity load forecasting and more.
37
Conflicts of interest: We declare no conflict of interest.
Acknowledgements: We are grateful to the Editor, Associate Editor, and the reviewers
for their constructive comments and suggestions, which helped us to improve the
manuscript.
Appendix A Nomenclature
Indices
1–a Central prediction interval
n Index of base-learners
t Index of time periods
q Index of quantiles
Sets
N Set of base-learners
T Set of time periods
Q Set of quantiles
Functions
1(∙) The indicator function
|A| Cardinality of a set A
fn,q(∙) The n-th base learner for the q-th quantile
fe,q(∙) The ensemble learner for the q-th quantile
fn(∙) The n-th base learner
Variables
xt Inputs of a model at time t
yt Actual streamflow at time t
prt Observed daily precipitation at time t
pett Observed daily potential evapotranspiration at time t
vt Simulated daily streamflow at time t
yn,t,q The forecasted q-th quantile of the n-th base-learner at time t
yt,q The forecasted q-th quantile of the ensemble learner at time t
Ln,t,a Interval score of the n-th base-learner at time t for the 1 – a prediction
38
interval
Lt,a Interval score of the weighted average of the N methods at time t for the 1 – a
prediction interval
Appendix B Used software
All computations and visualizations were conducted in R Programming Language (R
Core Team 2019) using the following packages: airGR (Coron et al. 2017, 2019),
bestNormalize (Peterson 2018), data.table (Dowle and Srinivasan 2019),
devtools (Wickham et al. 2019c), doParallel (Microsoft Corporation and Weston
2017), dplyr (Wickham et al. 2019b), foreach (Microsoft and Weston 2018), gdata
(Warnes et al. 2017), ggplot2 (Wickham 2016; Wickham et al. 2019a), grf (Tibshirani
et al. 2018), knitr (Xie 2014, 2015, 2019), quantreg (Koenker 2018), readr
(Wickham et al. 2018), reshape2 (Wickham 2007, 2017), rmarkdown (Allaire et al.
2019), stringi (Gagolewski 2019), stringr (Wickham 2019).
References
[1] Addor N, Newman AJ, Mizukami N, Clark MP (2017a) Catchment attributes for
large-sample studies. Boulder, CO: UCAR/NCAR.
https://doi.org/10.5065/D6G73C3Q
[2] Addor N, Newman AJ, Mizukami N, Clark MP (2017b) The CAMELS data set:
Catchment attributes and meteorology for large-sample studies. Hydrology and
Earth System Sciences 21:5293–5313. https://doi.org/10.5194/hess-21-5293-
2017
[3] Allaire JJ, Xie Y, McPherson J, Luraschi J, Ushey K, Atkins A, Wickham H, Cheng J,
Chang W, Iannone R (2019) rmarkdown: Dynamic documents for R. R package
version 1.12. https://CRAN.R-project.org/package=rmarkdown
[4] Alpaydin E (2014) Introduction to Machine Learning, 3rd Edition. The MIT Press,
Cambridge, Massachusetts
[5] Antal P, Fannes G, Timmerman D, Moreau Y, de Moor B (2003) Bayesian
applications of belief networks and multilayer perceptrons for ovarian tumor
classification with rejection. Artificial Intelligence in Medicine 29(1–2):39–60.
https://doi.org/10.1016/S0933-3657(03)00053-8
[6] Athey S, Tibshirani J, Wager S (2019) Generalized random forests. The Annals of
Statistics 47(2):1148–1178. https://doi.org/10.1214/18-AOS1709
[7] Baran S, Lerch S (2018) Combining predictive distributions for the statistical
post-processing of ensemble forecasts. International Journal of Forecasting
34(3):477–496. https://doi.org/10.1016/j.ijforecast.2018.01.005
39
[8] Beck HE, van Dijk AIJM, de Roo A, Dutra E, Fink G, Orth R, Schellekens J (2017)
Global evaluation of runoff from 10 state-of-the-art hydrological models.
Hydrology and Earth System Sciences 21(6):2881–2903.
https://doi.org/10.5194/hess-21-2881-2017
[9] Bentzien S, Friederichs P (2014) Decomposition and graphical portrayal of the
quantile score. Quarterly Journal of the Royal Meteorological Society
140(683):1924–1934. https://doi.org/10.1002/qj.2284
[10] Bhuiyan MAE, Nikolopoulos EI, Anagnostou EN, Quintana-Seguí P, Barella-Ortiz
A (2018) A nonparametric statistical technique for combining global
precipitation datasets: development and hydrological evaluation over the
Iberian Peninsula. Hydrology and Earth System Sciences 22:1371–1389.
[11] Biau G, Scornet E (2016) A random forest guided tour. TEST 25(2):197–227.
https://doi.org/10.1007/s11749-016-0481-7
[12] Bock AR, Farmer WH, Hay LE (2018) Quantifying uncertainty in simulated
streamflow and runoff from a continental-scale monthly water balance model.
Advances in Water Resources 122:166–175.
https://doi.org/10.1016/j.advwatres.2018.10.005
[13] Bogner K, Pappenberger F (2011) Multiscale error analysis, correction, and
predictive uncertainty estimation in a flood forecasting system. Water
Resources Research 47(7):W07524. https://doi.org/10.1029/2010WR009137
[14] Bogner K, Pappenberger F, Cloke HL (2012) Technical Note: The normal
quantile transformation and its application in a flood forecasting system.
Hydrology and Earth System Sciences 16:1085–1094.
[15] Bogner K, Liechti K, Zappa M (2016) Post-processing of stream flows in
Switzerland with an emphasis on low flows and floods. Water 8(4):115.
https://doi.org/10.3390/w8040115
[16] Bogner K, Liechti K, Zappa M (2017) Technical note: Combining quantile
forecasts and predictive distributions of streamflows. Hydrology and Earth
System Sciences 21:5493–5502. https://doi.org/10.5194/hess-21-5493-2017
[17] Bourgin F, Andréassian V, Perrin C, Oudin L (2015) Transferring global
uncertainty estimates from gauged to ungauged catchments. Hydrology and
2015
[18] Box GEP, Jenkins GM, Cloke HL, Reinsel GC, Ljung GM (2015) Time Series
Analysis: Forecasting and Control, 5th Edition. John Wiley & Sons, Inc., Hoboken,
New Jersey
[19] Brath A, Rosso R (1993) Adaptive calibration of a conceptual model for flash
flood forecasting. Water Resources Research 29(8):2561–2572.
https://doi.org/10.1029/93WR00665
[20] Breiman L (1996a) Stacked regressions. Machine Learning 24(1):49–64.
https://doi.org/10.1007/BF00117832
[21] Breiman L (1996b) Bagging predictors. Machine Learning 24(2):123–140.
https://doi.org/10.1007/BF00058655
[22] Breiman L (2001) Random forests. Machine Learning 45(1):5–32.
https://doi.org/10.1023/A:1010933404324
[23] Bzdok D, Altman N, Krzywinski M (2018) Statistics versus machine learning.
Nature Methods 15:233–234. https://doi.org/10.1038/nmeth.4642
40
[24] Coron L, Thirel G, Delaigue O, Perrin C, Andréassian V (2017) The suite of
lumped GR hydrological models in an R package. Environmental Modelling and
Software 94:166–171. https://doi.org/10.1016/j.envsoft.2017.05.002
[25] Coron L, Delaigue O, Thirel G, Perrin C, Michel C (2019) airGR: Suite of GR
hydrological models for precipitation-runoff modelling. R package version
1.2.13.16. https://CRAN.R-project.org/package=airGR
[26] Das T, Bárdossy A, Zehe E, He Y (2008) Comparison of conceptual model
performance using different representations of spatial variability. Journal of
Hydrology 356(1–2):106–118. https://doi.org/10.1016/j.jhydrol.2008.04.008
[27] Díaz-Uriarte R, De Andres SA (2006) Gene selection and classification of
microarray data using random forest. BMC Bioinformatics 7:3.
https://doi.org/10.1186/1471-2105-7-3
[28] Dogulu N, López López P, Solomatine DP, Weerts AH, Shrestha DL (2015)
Estimation of predictive hydrologic uncertainty using the quantile regression
and UNEEC methods and their comparison on contrasting catchments.
[29] Dowle M, Srinivasan A (2019) data.table: Extension of 'data.frame'. R package
version 1.12.2. https://CRAN.R-project.org/package=data.table
[30] Dunsmore IR (1968) A Bayesian approach to calibration. Journal of the Royal
Statistical Society. Series B (Methodological) 30(2):396–405.
https://doi.org/10.1016/j.rser.2018.05.038
[31] Edijatno, Nascimento NO, Yang X, Makhlouf Z, Michel C (1999) GR3J: A daily
watershed model with three free parameters. Hydrological Sciences Journal
44(2):263–277. https://doi.org/10.1080/02626669909492221
[32] Evensen G (1994) Sequential data assimilation with a nonlinear
quasi-geostrophic model using Monte Carlo methods to forecast error statistics.
Journal of Geophysical Research 99(C5):10143–10162.
https://doi.org/10.1029/94JC00572
[33] Evin G, Thyer M, Kavetski D, McInerney D, Kuczera G (2014) Comparison of joint
versus postprocessor approaches for hydrological uncertainty estimation
accounting for error autocorrelation and heteroscedasticity. Water Resources
Research 50(3):2350–2375. https://doi.org/10.1002/2013WR014185
[34] Friederichs P, Hense A (2007) Statistical downscaling of extreme precipitation
events using censored quantile regression. Monthly Weather Review 135:2365–
2378. https://doi.org/10.1175/MWR3403.1
[35] Friedman JH (2001) Greedy function approximation: A gradient boosting
machine. The Annals of Statistics 29(5):1189–1232.
https://doi.org/10.1214/aos/1013203451
[36] Gagolewski M (2019) stringi: Character string processing facilities. R package
version 1.4.3. https://CRAN.R-project.org/package=stringi
[37] Gneiting T, Raftery AE (2007) Strictly proper scoring rules, prediction, and
estimation. Journal of the American Statistical Association 102(477):359–378.
https://doi.org/10.1198/016214506000001437
[38] Gneiting T, Ranjan R (2013) Combining predictive distributions. Electronic
Journal of Statistics 7:1747–1782. https://doi.org/10.1214/13-EJS823
41
[39] Gneiting T, Raftery AE, Westveld AH, Goldman T (2005) Calibrated probabilistic
forecasting using ensemble model output statistics and minimum CRPS
Estimation. Monthly Weather Review 133:1098–1118.
https://doi.org/10.1175/MWR2904.1
[40] Hamill TM, Wilks DS (1995) A Probabilistic forecast contest and the difficulty in
assessing short-range forecast uncertainty. Weather and Forecasting 10:620–
631. https://doi.org/10.1175/1520-0434(1995)010<0620:APFCAT>2.0.CO;2
[41] Hannan EJ, Dunsmuir WTM, Deistler M (1980) Estimation of vector ARMAX
models. Journal of Multivariate Analysis 10(3):275–295.
https://doi.org/10.1016/0047-259X(80)90050-0
[42] Hastie T, Tibshirani R, Friedman J (2009) The Elements of Statistical Learning.
Springer-Verlag New York. https://doi.org/10.1007/978-0-387-84858-7
[43] Hemri S (2018) Chapter 8 - Applications of Postprocessing for Hydrological
Forecasts. In: Vannitsem S, Wilks DS, Messner JW (eds) Statistical
Postprocessing of Ensemble Forecasts. Elsevier, pp 219–240.
https://doi.org/10.1016/B978-0-12-812372-0.00008-X
[44] Hernández-López MR, Francés F (2017) Bayesian joint inference of hydrological
and generalized error models with the enforcement of Total Laws. Hydrology
and Earth System Sciences Discussions. https://doi.org/10.5194/hess-2017-9
[45] Hong T, Pinson P, Fan S, Zareipour H, Troccoli A, Hyndman RJ (2016)
Probabilistic energy forecasting: Global Energy Forecasting Competition 2014
and beyond. International Journal of Forecasting 32(3):896–913.
https://doi.org/10.1016/j.ijforecast.2016.02.001
[46] James G, Witten D, Hastie T, Tibshirani R (2013) An Introduction to Statistical
Learning. Springer-Verlag New York. https://doi.org/10.1007/978-1-4614-
7138-7
[47] Kaleris V, Langousis A (2017) Comparison of two rainfall–runoff models: effects
of conceptualization on water budget components. Hydrological Sciences
Journal 62(5):729–748. https://doi.org/10.1080/02626667.2016.1250899
[48] Kalman RE (1960) A new approach to linear filtering and prediction problems.
Journal of Basic Engineering 82(1):35–45. https://doi.org/10.1115/1.3662552
[49] Kavetski D, Franks SW, Kuczera G (2002) Confronting Input Uncertainty in
Environmental Modelling. In: Duan Q, Gupta HV, Sorooshian S, Rousseau AN,
Turcotte R (eds) Calibration of Watershed Models. AGU, pp 49–68.
https://doi.org/10.1029/WS006p0049
[50] Klemeš V (1986) Operational testing of hydrological simulation models.
Hydrological Sciences Journal 31(1):13–24.
https://doi.org/10.1080/02626668609491024
[51] Koenker RW (2005) Quantile regression. Cambridge University Press,
Cambridge, UK
[52] Koenker RW (2017) Quantile regression: 40 years on. Annual Review of
Economics 9(1):155–176. https://doi.org/10.1146/annurev-economics-
063016-103651
[53] Koenker RW (2018) quantreg: Quantile regression. R package version 5.38.
https://CRAN.R-project.org/package=quantreg
[54] Koenker RW, Bassett Jr G (1978) Regression quantiles. Econometrica 46(1):33–
50. https://doi.org/10.2307/1913643
42
[55] Koenker RW, D'Orey V (1987) Computing regression quantiles. Journal of the
Royal Statistical Society: Series C (Applied Statistics) 36(3):383–393.
https://doi.org/10.2307/2347802
[56] Koenker RW, D'Orey V (1994) A remark on algorithm AS 229: Computing dual
regression quantiles and regression rank scores. Journal of the Royal Statistical
Society: Series C (Applied Statistics) 43(2):410–414.
https://doi.org/10.2307/2986030
[57] Koenker RW, Machado JAF (1999) Goodness of fit and related inference
processes for quantile regression. Journal of the American Statistical
Association 94(448):1296–1310.
https://doi.org/10.1080/01621459.1999.10473882
[58] Koutsoyiannis D, Montanari A (2015) Negligent killing of scientific concepts: the
stationarity case. Hydrological Sciences Journal 60(7–8):1174–1183.
https://doi.org/10.1080/02626667.2014.959959
[59] Krzysztofowicz R (1987) Markovian forecast processes. Journal of the American
Statistical Association 82(397):31–37.
https://doi.org/10.1080/01621459.1987.10478387
[60] Krzysztofowicz R (1997) Transformation and normalization of variates with
specified distributions. Journal of Hydrology 1997(1–4):286–292.
https://doi.org/10.1016/S0022-1694(96)03276-3
[61] Krzysztofowicz R (1999) Bayesian theory of probabilistic forecasting via
deterministic hydrologic model. Water Resources Research 35(9):2739–2750.
https://doi.org/10.1029/1999WR900099
[62] Krzysztofowicz R (2001) The case for probabilistic forecasting in hydrology.
Journal of Hydrology 249(1–4):2–9. https://doi.org/10.1016/S0022-
1694(01)00420-6
[63] Krzysztofowicz R (2002) Bayesian system for probabilistic river stage
forecasting. Journal of Hydrology 268:16–40. https://doi.org/10.1016/S0022-
1694(02)00106-3
[64] Krzysztofowicz R, Kelly KS (2000) Hydrologic uncertainty processor for
probabilistic river stage forecasting. Water Resources Research 36:3265–3277.
https://doi.org/10.1029/2000WR900108
[65] Kuczera G, Kavetski D, Franks S, Thyer M (2006) Towards a Bayesian total error
analysis of conceptual rainfall-runoff models: Characterising model error using
storm-dependent parameters. Journal of Hydrology 331(1–2):161–177.
https://doi.org/10.1016/j.jhydrol.2006.05.010
[66] Langousis A, Mamalakis A, Puliga M, Deida R (2016) Threshold detection for the
generalized Pareto distribution: Review of representative methods and
application to the NOAA NCDC daily rainfall database. Water Resources
Research 52(4):2659–2681. https://doi.org/10.1002/2015WR018502
[67] Li W, Duan Q, Miao C, Ye A, Gong W, Di Z (2017) A review on statistical
postprocessing methods for hydrometeorological ensemble forecasting. Wiley
Interdisciplinary Reviews: Water 4(6):e1246.
https://doi.org/10.1002/wat2.1246
[68] Lichtendahl Jr KC, Grushka-Cockayne Y, Winkler RL (2013) Is it better to
average probabilities or quantiles?. Management Science 59(7):1594–1611.
https://doi.org/10.1287/mnsc.1120.1667
43
[69] Lidén R, Harlin J (2000) Analysis of conceptual rainfall–runoff modelling
performance in different climates. Journal of Hydrology 238(3–4):231–247.
https://doi.org/10.1016/S0022-1694(00)00330-9
[70] López López P, Verkade JS, Weerts AH, Solomatine DP (2014) Alternative
configurations of quantile regression for estimating predictive uncertainty in
water level forecasts for the upper Severn River: a comparison. Hydrology and
2014
[71] Mayr A, Binder H, Gefeller O, Schmid M (2014) The evolution of boosting
algorithms. Methods of Information in Medicine 53(06):419–427.
https://doi.org/10.3414/ME13-01-0122
[72] Meinshausen N (2006) Quantile regression forests. Journal of Machine Learning
Research 7:983–999
[73] Messner JW (2018) Chapter 11 - Ensemble Postprocessing With R. In:
Vannitsem S, Wilks DS, Messner JW (eds) Statistical Postprocessing of Ensemble
Forecasts. Elsevier, pp 291–329. https://doi.org/10.1016/B978-0-12-812372-
0.00011-X
[74] Michel C (1991) Hydrologie appliquée aux petits bassins ruraux. Cemagref,
Antony, France
[75] Microsoft, Weston S (2017) foreach: Provides foreach looping construct for R. R
package version 1.4.4. https://CRAN.R-project.org/package=foreach
[76] Microsoft Corporation, Weston S (2018) doParallel: Foreach parallel adaptor for
the 'parallel' package. R package version 1.0.14. https://CRAN.R-
project.org/package=doParallel
[77] Min C, Zellner A (1993) Bayesian and non-Bayesian methods for combining
models and forecasts with applications to forecasting international growth
rates. Journal of Econometrics 56(1–2):89–118. https://doi.org/10.1016/0304-
4076(93)90102-B
[78] Montanari A (2011) 2.17 - Uncertainty of Hydrological Predictions. In: Wilderer
P (ed) Treatise on Water Science. Elsevier, pp 459–478.
https://doi.org/10.1016/B978-0-444-53199-5.00045-2
[79] Montanari A, Brath A (2004) A stochastic approach for assessing the
uncertainty of rainfall-runoff simulations. Water Resources Research
40(1):W01106. https://doi.org/10.1029/2003WR002540
[80] Montanari A, Grossi G (2008) Estimating the uncertainty of hydrological
forecasts: A statistical approach. Water Resources Research 44(12):W00B08.
https://doi.org/10.1029/2008WR006897
[81] Montanari A, Koutsoyiannis D (2012) A blueprint for process-based modeling of
uncertain hydrological systems. Water Resources Research 48(9):W09555.
https://doi.org/10.1029/2011WR011412
[82] Mouelhi S, Michel C, Perrin C, Andréassian V (2006a) Stepwise development of a
two-parameter monthly water balance model. Journal of Hydrology 318(1–
4):200–214. https://doi.org/10.1016/j.jhydrol.2005.06.014
[83] Mouelhi S, Michel C, Perrin C, Andréassian V (2006b) Linking stream flow to
rainfall at the annual time step: the Manabe bucket model revisited. Journal of
Hydrology 328(1–2):283–296. https://doi.org/10.1016/j.jhydrol.2005.12.022
[84] Nash JE, Sutcliffe JV (1970) River flow forecasting through conceptual models
part I — A discussion of principles. Journal of Hydrology 10(3):282–290.
https://doi.org/10.1016/0022-1694(70)90255-6
44
[85] Natekin A, Knoll A (2013) Gradient boosting machines, a tutorial. Frontiers in
Neurorobotics 7:21. https://doi.org/10.3389/fnbot.2013.00021
[86] Newman AJ, Sampson K, Clark MP, Bock A, Viger RJ, Blodgett D (2014) A large-
sample watershed-scale hydrometeorological dataset for the contiguous USA.
Boulder, CO: UCAR/NCAR. https://doi.org/10.5065/D6MW2F4D
[87] Newman AJ, Clark MP, Sampson K, Wood A, Hay LE, Bock A, Viger RJ, Blodgett D,
Brekke L, Arnold JR, Hopson T, Duan Q (2015) Development of a large-sample
watershed-scale hydrometeorological data set for the contiguous USA: data set
characteristics and assessment of regional variability in hydrologic model
performance. Hydrology and Earth System Sciences 19:209–223.
[88] Newman AJ, Mizukami N, Clark MP, Wood AW, Nijssen B, Nearing G (2017)
Benchmarking of a physically based hydrologic model. Journal of
Hydrometeorology 18:2215–2225. https://doi.org/10.1175/JHM-D-16-0284.1
[89] Nikolopoulos EI, Destro E, Bhuiyan MAE, Borga M, Anagnostou EN (2018)
Evaluation of predictive models for post-fire debris flow occurrence in the
western United States. Natural Hazards and Earth System Sciences 18:2331–
2343. https://doi.org/10.5194/nhess-18-2331-2018
[90] Oshiro TM, Perez PS, Baranauskas JA (2012) How many trees in a random
forest?. In: Perner P (ed) Machine Learning and Data Mining in Pattern
Recognition (Lecture Notes in Computer Science). Springer-Verlag Berlin
Heidelberg, IBaI, Leipzig, Germany, 2012; Volume 7376, pp 154–168.
https://doi.org/10.1007/978-3-642-31537-4
[91] Ouali D, Chebana F, Ouarda TBMJ (2016) Quantile regression in regional
frequency analysis: A better exploitation of the available information. Journal of
Hydrometeorology 17:1869–1883. https://doi.org/10.1175/JHM-D-15-0187.1
[92] Oudin L, Hervieu F, Michel C, Perrin C, Andréassian V, Anctil F, Loumagne C
(2005) Which potential evapotranspiration input for a lumped rainfall–runoff
model?: Part 2—Towards a simple and efficient potential evapotranspiration
model for rainfall–runoff modelling. Journal of Hydrology 303(1–4):290–306.
[93] Pagano TC, Shrestha DL, Wang QJ, Robertson D, Hapuarachchi P (2013)
Ensemble dressing for hydrological applications. Hydrological Processes
27(1):106–116. https://doi.org/10.1002/hyp.9313
[94] Papacharalampous G, Tyralis H (2018) Evaluation of random forests and
Prophet for daily streamflow forecasting. Advances in Geosciences 45:201–208.
https://doi.org/10.5194/adgeo-45-201-2018
[95] Papacharalampous G, Tyralis H, Koutsoyiannis D (2018a) One-step ahead
forecasting of geophysical processes within a purely statistical framework.
Geoscience Letters 5(12). https://doi.org/10.1186/s40562-018-0111-1
[96] Papacharalampous G, Tyralis H, Koutsoyiannis D (2018b) Predictability of
monthly temperature and precipitation using automatic time series forecasting
methods. Acta Geophysica 66(4):807–831. https://doi.org/10.1007/s11600-
018-0120-7
[97] Papacharalampous G, Tyralis H, Koutsoyiannis D (2018c) Univariate time series
forecasting of temperature and precipitation with a focus on machine learning
algorithms: A multiple-case study from Greece. Water Resources Management
32(15):5207–5239. https://doi.org/10.1007/s11269-018-2155-6
45
[98] Papacharalampous G, Tyralis H, Koutsoyiannis D (2019a) Comparison of
stochastic and machine learning methods for multi-step ahead forecasting of
hydrological processes. Stochastic Environmental Research and Risk
Assessment 33(2):481–514. https://doi.org/10.1007/s00477-018-1638-6
[99] Papacharalampous G, Koutsoyiannis D, Montanari A (2019b) Quantification of
predictive uncertainty in hydrological modelling by harnessing the wisdom of
the crowd: Methodology development and investigation using toy models.
https://doi.org/10.13140/RG.2.2.32868.22401
[100] Papacharalampous G, Tyralis H, Koutsoyiannis D, Montanari A (2019c)
Quantification of predictive uncertainty in hydrological modelling by harnessing
the wisdom of the crowd: A large–sample experiment at monthly timescale.
https://doi.org/10.13140/RG.2.2.16091.00801
[101] Perrin C, Michel C, Andréassian V (2001) Does a large number of parameters
enhance model performance? Comparative assessment of common catchment
model structures on 429 catchments. Journal of Hydrology 242(3–4):275–301.
https://doi.org/10.1016/S0022-1694(00)00393-0
[102] Perrin C, Michel C, Andréassian V (2003) Improvement of a parsimonious model
for streamflow simulation. Journal of Hydrology 279(1–4):275–289.
https://doi.org/10.1016/S0022-1694(03)00225-7
[103] Peterson RA (2018) bestNormalize: Normalizing transformation functions. R
package version 1.3.0. https://CRAN.R-project.org/package=bestNormalize
[104] Probst P, Boulesteix AL (2018) To tune or not to tune the number of trees in
random forest. Journal of Machine Learning Research 18(181):1–18
[105] R Core Team (2019) R: A language and environment for statistical computing. R
Foundation for Statistical Computing, Vienna, Austria. https://www.R-
project.org/
[106] Raftery AE, Madigan D, Hoeting JA (1997) Bayesian model averaging for linear
regression models. Journal of the American Statistical Association 92(437):179–
191. https://doi.org/10.1080/01621459.1997.10473615
[107] Raftery AE, Gneiting T, Balabdaoui F, Polakowski M (2005) Using Bayesian
model averaging to calibrate forecast ensembles. Monthly Weather Review
133:1155–1174. https://doi.org/10.1175/MWR2906.1
[108] Ranjan R, Gneiting T (2010) Combining probability forecasts. Journal of the
Royal Statistical Society: Series B (Statistical Methodology) 72(1):71–91.
https://doi.org/10.1111/j.1467-9868.2009.00726.x
[109] Reinsel G (1979) Maximum likelihood estimation of stochastic linear difference
equations with autoregressive moving average errors. Econometrica
47(1):129–151. https://doi.org/10.2307/1912351
[110] Rigby RA, Stasinopoulos DM (2005) Generalized additive models for location,
scale and shape. Journal of the Royal Statistical Society: Series C (Applied
Statistics) 54(3):507–554. https://doi.org/10.1111/j.1467-9876.2005.00510.x
[111] Sagi O, Rokach L (2018) Ensemble learning: A survey. Wiley Interdisciplinary
Reviews: Data Mining and Knowledge Discovery 8(4):e1249.
https://doi.org/10.1002/widm.1249
[112] Scornet E, Biau G, Vert JP (2015) Consistency of random forests. The Annals of
Statistics 43(4):1716–1741. https://doi.org/10.1214/15-AOS1321
46
[113] Seo DJ, Herr HD, Schaake JC (2006) A statistical post-processor for accounting of
hydrologic uncertainty in short-range ensemble streamflow prediction.
Hydrology and Earth System Sciences Discussions 3:1987–2035.
https://doi.org/10.5194/hessd-3-1987-2006
[114] Shastri H, Ghosh S, Karmakar S (2017) Improving global forecast system of
extreme precipitation events with regional statistical model: Application of
quantile-based probabilistic forecasts. Journal of Geophysical Research
122(3):1617–1634. https://doi.org/10.1002/2016JD025489
[115] Smyth P, Wolpert D (1999) Linearly combining density estimators via stacking.
Machine Learning 36(1–2):59–83. https://doi.org/10.1023/A:1007511322260
[116] Solomatine DP, Wagener T (2011) 2.16 - Hydrological Modeling. In: Wilderer P
(ed) Treatise on Water Science. Elsevier, pp 435–457.
https://doi.org/10.1016/B978-0-444-53199-5.00044-0
[117] Taillardat M, Mestre O, Zamo M, Naveau P (2016) Calibrated ensemble forecasts
using quantile regression forests and ensemble model output statistics. Monthly
Weather Review 144:2375–2393. https://doi.org/10.1175/MWR-D-15-0260.1
[118] Taylor JW (2000) A quantile regression neural network approach to estimating
the conditional density of multiperiod returns. Journal of Forecasting
19(4):299–311. https://doi.org/10.1002/1099-131X(200007)19:4<299::AID-
FOR775>3.0.CO;2-V
[119] Thornton PE, Thornton MM, Mayer BW, Wilhelmi N, Wei Y, Devarakonda R,
Cook RB (2014) Daymet: Daily surface weather data on a 1-km grid for North
America, version 2. ORNL DAAC, Oak Ridge, Tennessee, USA. Date accessed:
2016/01/20. https://doi.org/10.3334/ORNLDAAC/1219
[120] Tibshirani J, Athey S, Wager S (2018) grf: Generalized random forests (beta). R
package version 0.10.2. https://CRAN.R-project.org/package=grf
[121] Todini E (2007) Hydrological catchment modelling: Past, present and future.
[122] Toth E, Montanari A, Brath A (1999) Real-time flood forecasting via combined
use of conceptual and stochastic models. Physics and Chemistry of the Earth,
Part B: Hydrology, Oceans and Atmosphere 24(7):793–798.
https://doi.org/10.1016/S1464-1909(99)00082-9
[123] Trapero JR, Cardós M, Kourentzes N (2019) Quantile forecast optimal
combination to enhance safety stock estimation. International Journal of
Forecasting 35(1):239–250. https://doi.org/10.1016/j.ijforecast.2018.05.009
[124] Tyralis H, Koutsoyiannis D (2014) A Bayesian statistical model for deriving the
predictive distribution of hydroclimatic variables. Climate Dynamics 42(11–
12):2867–2883. https://doi.org/10.1007/s00382-013-1804-y
[125] Tyralis H, Koutsoyiannis D (2017) On the prediction of persistent processes
using the output of deterministic models. Hydrological Sciences Journal
62(13):2083–2102. https://doi.org/10.1080/02626667.2017.1361535
[126] Tyralis H, Papacharalampous G (2017) Variable selection in time series
forecasting using random forests. Algorithms 10(4):114.
https://doi.org/10.3390/a10040114
[127] Tyralis H, Papacharalampous G (2018) Large-scale assessment of Prophet for
multi-step ahead forecasting of monthly streamflow. Advances in Geosciences
45:147–153. https://doi.org/10.5194/adgeo-45-147-2018
47
[128] Tyralis H, Dimitriadis P, Koutsoyiannis D, O'Connell PE, Tzouka K, Iliopoulou T
(2018) On the long-range dependence properties of annual precipitation using a
global network of instrumental measurements. Advances in Water Resources
111:301–318. https://doi.org/10.1016/j.advwatres.2017.11.010
[129] Tyralis H, Papacharalampous G, Langousis A (2019a) A brief review of random
forests for water scientists and practitioners and their recent history in water
resources. Water 11(5):910. https://doi.org/10.3390/w11050910
[130] Tyralis H, Papacharalampous G, Tantanee S (2019b) How to explain and predict
the shape parameter of the generalized extreme value distribution of
streamflow extremes using a big dataset. Journal of Hydrology 574:628–645.
[131] Verikas A, Gelzinis A, Bacauskiene M (2011) Mining data with random forests: A
survey and results of new tests. Pattern Recognition 44(2):330–349.
https://doi.org/10.1016/j.patcog.2010.08.011
[132] Vrugt JA, Robinson BA (2007) Treatment of uncertainty using ensemble
methods: Comparison of sequential data assimilation and Bayesian model
averaging. Water Resources Research 43(1):W01411.
https://doi.org/10.1029/2005WR004838
[133] Waldmann E (2018) Quantile regression: A short story on how and why.
Statistical Modelling 18(3–4):203–218.
https://doi.org/10.1177/1471082X18759142
[134] Wang Y, Zhang N, Tan Y, Hong T, Kirschen DS, Kang C (2019) Combining
probabilistic load forecasts. IEEE Transactions on Smart Grid 10(4):3664–3674.
https://doi.org/10.1109/TSG.2018.2833869
[135] Warnes GR, Bolker B, Gorjanc G, Grothendieck G, Korosec A, Lumley T,
MacQueen D, Magnusson A, Rogers J (2017) gdata: Various R programming
tools for data manipulation. R package version 2.18.0. https://CRAN.R-
project.org/package=gdata
[136] Weerts AH, Winsemius HC, Verkade JS (2011) Estimation of predictive
hydrological uncertainty using quantile regression: Examples from the national
flood forecasting system (England and Wales). Hydrology and Earth System
Sciences 15:255–265. https://doi.org/10.5194/hess-15-255-2011
[137] Weijs SV, Schoups G, Van de Giesen N (2010) Why hydrological predictions
should be evaluated using information theory. Hydrology and Earth System
Sciences 14:2545–2558. https://doi.org/10.5194/hess-14-2545-2010
[138] Wickham H (2007) Reshaping data with the reshape package. Journal of the
Statistical Software 21(12). https://doi.org/10.18637/jss.v021.i12
[139] Wickham H (2016) ggplot2. Springer-Verlag New York.
https://doi.org/10.1007/978-0-387-98141-3
[140] Wickham H (2017) reshape2: Flexibly reshape data: A reboot of the reshape
package. R package version 1.4.3. https://CRAN.R-
project.org/package=reshape2
[141] Wickham H (2019) stringr: Simple, consistent wrappers for common string
operations. R package version 1.4.0. https://CRAN.R-
project.org/package=stringr
[142] Wickham H, Hester J, Francois R (2018) readr: Read rectangular text data. R
package version 1.3.1. https://CRAN.R-project.org/package=readr
48
[143] Wickham H, Chang W, Henry L, Pedersen TL, Takahashi K, Wilke C, Woo K
(2019a) ggplot2: Create elegant data visualisations using the grammar of
graphics. R package version 3.1.1. https://CRAN.R-project.org/package=ggplot2
[144] Wickham H, François R, Henry L, Müller K (2019b) dplyr: A grammar of data
manipulation. R package version 0.8.0.1. https://CRAN.R-
project.org/package=dplyr
[145] Wickham H, Hester J, Chang W (2019c) devtools: Tools to make developing R
packages easier. R package version 2.0.2. https://CRAN.R-
project.org/package=devtools
[146] Winkler RL (1972) A decision-theoretic approach to interval estimation. Journal
of the American Statistical Association 67(337):187–191.
https://doi.org/10.1080/01621459.1972.10481224
[147] Wolpert DH (1992) Stacked generalization. Neural Networks 5(2):241–259.
https://doi.org/10.1016/S0893-6080(05)80023-1
[148] Xie Y (2014) knitr: A Comprehensive Tool for Reproducible Research in R. In:
Stodden V, Leisch F, Peng RD (Eds) Implementing Reproducible Computational
Research. Chapman and Hall/CRC
[149] Xie Y (2015) Dynamic Documents with R and knitr, 2nd edition. Chapman and
Hall/CRC
[150] Xie Y (2019) knitr: A general-purpose package for dynamic report generation in
R. R package version 1.22. https://CRAN.R-project.org/package=knitr
[151] Xu L, Chen N, Zhang X, Chen Z (2018) An evaluation of statistical, NMME and
hybrid models for drought prediction in China. Journal of Hydrology 566:235–
249. https://doi.org/10.1016/j.jhydrol.2018.09.020
[152] Yan J, Liao GY, M Gebremichael, Shedd R, Vallee DR (2014) Characterizing the
uncertainty in river stage forecasts conditional on point forecast values. Water
Resources Research 48(12):W12509. https://doi.org/10.1029/2012WR011818
[153] Yao Y, Vehtari A, Simpson D, Gelman A (2018) Using stacking to average
Bayesian predictive distributions. Bayesian Analysis 13(3):917–1003.
https://doi.org/10.1214/17-BA1091
[154] Ye A, Duan Q, Yuan X, Wood EF, Schaake J (2014) Hydrologic post-processing of
MOPEX streamflow simulations. Journal of Hydrology 508:147–156.
[155] Yu B, Xu Z (2008) A comparative study for content-based dynamic spam
classification using four machine learning algorithms. Knowledge-Based
Systems 21(4):355–362. https://doi.org/10.1016/j.knosys.2008.01.001
[156] Zhao L, Duan Q, Schaake J, Ye A, Xia J (2011) A hydrologic post-processor for
ensemble streamflow predictions. Advances in Geosciences 29:51–59.
https://doi.org/10.5194/adgeo-29-51-2011
49
Conflicts of interest: The authors declare no conflict of interest.
50
Highlights
 Probabilistic forecasts are combined using weighted stacked generalization.

 Quantile regression and quantile regression forests are used as base-learners.
 Ensemble learning (EL) is used to postprocess hydrological model simulations.
 EL performance is assessed based on 511 time series of daily streamflows in CONUS.
 Average relative improvement over the two base-learners is approximately 6%.
51

Hydrological Post-Processing Using Stacked Generalization-Paper

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Hydrological Post-Processing Using Stacked Generalization-Paper

Загружено:

Авторское право:

Доступные форматы

Accepted Manuscript

Hydrological post-processing using stacked generalization of quantile regres-

Hristos Tyralis, Georgia Papacharalampous, Apostolos Burnetas, Andreas

To appear in: Journal of Hydrology

Received Date: 6 January 2019

Hristos Tyralis1, Georgia Papacharalampous2, Apostolos Burnetas3, and Andreas

Corresponding author: Hristos Tyralis (montchrister@gmail.com)

Abstract: Post-processing of hydrological model simulations using machine learning

Keywords: combining probabilistic forecasts; ensemble learning; hydrological

An important objective of hydrological models is to predict a variable of interest (e.g.

river discharge or runoff volume), usually referred to as predictand (using the

terminology of Krzysztofowicz 1999), as a response to other hydrological variables

Koutsoyiannis 2012, Hernández-López and Francés 2017, Tyralis and Koutsoyiannis

variables of interest), corresponding to different uncertainty sources; see e.g. the

detailed review on global uncertainty estimation by Montanari (2011). One way to do so

is to post-process hydrological model outputs using conditional distribution-based

models, regression-based methods, or other algorithmic approaches (see e.g. Li et al.

Here we are interested in the case where post-processing of hydrological predictions

is conducted using quantile regression-based models; for a detailed review of the

general framework of regression schemes in the context of hydrometeorological post-

include (see e.g. Messner 2018 for a detailed list):

the methodological framework, Ouali et al. 2016 for an application on regional

et al. 2015 for applications on hydrological post-processing, and Section 2.5.1).

application on flood forecasting systems).

and Vrugt and Robinson 2007 for an application).

(d) Generalized additive models (GAMLSS, where the distribution parameters of

In an effort to improve the accuracy of hydrological predictions, methods to combine

probabilistic forecasts originating from the application of algorithmic schemes to the

outputs of hydrological models (hereafter referred to as base-learners; see e.g. Alpaydin

be estimated independently in the context of each specific application and hydrological

Recognizing the need to combine probabilistic predictions without obtaining explicit

constrained quantile regression averaging (CQRA) method to directly combine quantile

probabilistic forecasts, independent of whether their predictive PDFs exhibit closed

used in hydrological post-processing to characterize the distribution properties of

sharpness, level of calibration etc.) of the predictions.

The aim of this study is to propose a novel method to improve probabilistic

predictions provided by single quantile regression algorithms, by combining

the base-learners. We are interested in obtaining central prediction intervals; therefore,

combining distribution forecasts, e.g. in the context of simple averaging (Lichtendahl Jr

post-processing the outputs of hydrological models. We assess the proposed

methodology by applying it to 511 basins in the contiguous US (CONUS), using

temperature, precipitation and streamflow data sourced from CAMELS (Catchment

and (b) post-processing of hydrological model simulations. The assessment is of large

Papacharalampous 2017, 2018, Bock et al. 2018, Papacharalampous et al. 2018a, b, c,

of hydrological post-processing (see e.g. Pagano et al. 2013).

aspects. In Section 4, we apply the suggested approach within the concept of

are detailed in Appendix A. Appendix B outlines the software packages used to

implement the presented methods and illustrations.

2.1 General introduction

Stacked generalization is a type of ensemble learning (Alpaydin 2014, pp. 487–515)

their subsequent combination (through a combiner learner) to obtain prediction

averaging of predictive quantiles of the base learners.

processing by Taillardat et al. (2016), as well as in other hydrological applications (see

with estimated weights that minimize the IS.

We produce probabilistic streamflow predictions at daily timescale by post-

paramètres Journalier) lumped conceptual hydrological model introduced by Perrin et

post-processing procedure, by considering two experiments:

(b) Experiment 2: Post-processing of hydrological model simulations, where at each

The proposed framework can also be applied by selecting different predictor

probabilistic predictions at multiple steps ahead.

streamflow simulations by using recorded temperature and precipitation as input data

performance of ML algorithms in post-processing hydrological model outputs, avoiding

expected to be wider (Hemri 2018), reflecting an increase in the uncertainty of the