Вы находитесь на странице: 1из 52

Accepted Manuscript

Research papers

Hydrological post-processing using stacked generalization of quantile regres-


sion algorithms: Large-scale application over CONUS

Hristos Tyralis, Georgia Papacharalampous, Apostolos Burnetas, Andreas


Langousis

PII: S0022-1694(19)30677-8
DOI: https://doi.org/10.1016/j.jhydrol.2019.123957
Article Number: 123957
Reference: HYDROL 123957

To appear in: Journal of Hydrology

Received Date: 6 January 2019


Revised Date: 5 July 2019
Accepted Date: 13 July 2019

Please cite this article as: Tyralis, H., Papacharalampous, G., Burnetas, A., Langousis, A., Hydrological post-
processing using stacked generalization of quantile regression algorithms: Large-scale application over CONUS,
Journal of Hydrology (2019), doi: https://doi.org/10.1016/j.jhydrol.2019.123957

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers
we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and
review of the resulting proof before it is published in its final form. Please note that during the production process
errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
Hydrological post-processing using stacked generalization of quantile
regression algorithms: Large-scale application over CONUS

Hristos Tyralis1, Georgia Papacharalampous2, Apostolos Burnetas3, and Andreas


Langousis4
1Air Force Support Command, Hellenic Air Force, Elefsina Air Base, 192 00 Elefsina,
Greece (https://orcid.org/0000-0002-8932-4997)
2Department of Water Resources and Environmental Engineering, School of Civil
Engineering, National Technical University of Athens, Iroon Polytechniou 5, 157 80
Zografou, Greece (https://orcid.org/0000-0001-5446-954X)
3Department of Mathematics, School of Science, National and Kapodistrian University of
Athens, Panepistemiopolis, 157 84 Athens, Greece (https://orcid.org/0000-0002-9365-
9255)
4Department of Civil Engineering, School of Engineering, University of Patras, University
Campus, Rio, 26 504, Patras, Greece (https://orcid.org/0000-0002-0643-2520)

Corresponding author: Hristos Tyralis (montchrister@gmail.com)

Abstract: Post-processing of hydrological model simulations using machine learning


algorithms can be applied to quantify the uncertainty of hydrological predictions.
Combining multiple diverse machine learning algorithms (referred to as base-learners)
using stacked generalization (stacking, i.e. a type of ensemble learning) is considered to
improve predictions relative to the base-learners. Here we propose stacking of quantile
regression and quantile regression forests. Stacking is performed by minimising the
interval score of the quantile predictions provided by the ensemble learner, which is a
linear combination of quantile regression and quantile regression forests. The proposed
ensemble learner post-processes simulations of the GR4J hydrological model for 511
basins in the contiguous US. We illustrate its significantly improved performance
relative to the base-learners used and a less prominent improvement relative to the
“hard to beat in practice” equal-weight combiner.

Keywords: combining probabilistic forecasts; ensemble learning; hydrological


uncertainty; interval score; quantile regression; quantile regression forests
1. Introduction

An important objective of hydrological models is to predict a variable of interest (e.g.

river discharge or runoff volume), usually referred to as predictand (using the

terminology of Krzysztofowicz 1999), as a response to other hydrological variables

(temperature, precipitation etc.; see e.g. Lidén and Harlin 2000, Mouelhi et al. 2006a, b,

Das et al. 2008, Kaleris and Langousis 2017). In this context, hydrological models can be

classified into three broad categories; i.e. physically based, conceptual, and data-driven

(see e.g. Solomatine and Wagener 2011). The output of the physically based and

conceptual models is point predictions of hydrologic quantities, which do not allow for

direct quantification of predictive uncertainties. To account for the latter, within the

general framework of probabilistic prediction (see e.g. Krzysztofowicz and Kelly 2000,

Krzysztofowicz 2001, 2002, Kavetski et al. 2002, Montanari and Brath 2004, Kuczera et

al. 2006, Todini 2007, Montanari and Grossi 2008, Weijs et al. 2010, Montanari and

Koutsoyiannis 2012, Hernández-López and Francés 2017, Tyralis and Koutsoyiannis

2017), one needs to estimate the probability distribution function (PDF) of the

predictand variable (or the joint probability distribution function of all predictand

variables of interest), corresponding to different uncertainty sources; see e.g. the

detailed review on global uncertainty estimation by Montanari (2011). One way to do so

is to post-process hydrological model outputs using conditional distribution-based

models, regression-based methods, or other algorithmic approaches (see e.g. Li et al.

2017).

Here we are interested in the case where post-processing of hydrological predictions

is conducted using quantile regression-based models; for a detailed review of the

general framework of regression schemes in the context of hydrometeorological post-

2
processing, the reader is referred to Li et al. (2017). Examples of relevant algorithms

include (see e.g. Messner 2018 for a detailed list):

(a) Quantile regression (see e.g. Koenker and Bassett Jr 1978, and Koenker 2005 on

the methodological framework, Ouali et al. 2016 for an application on regional

frequency analysis in hydrology, and Weerts et al. 2011, López López et al. 2014, Dogulu

et al. 2015 for applications on hydrological post-processing, and Section 2.5.1).

(b) Quantile regression neural networks (QRNN, where Artificial Neural Networks

are used to quantify the relationship between predictor variables and conditional

quantiles of dependent variables, see e.g. Taylor 2000, and Bogner et al. 2016 for an

application on flood forecasting systems).

Within the broader class of regression schemes, one can also consider:

(a) Autoregressive models with exogenous variables (ARX, see e.g. Reinsel 1979,

Hannan et al. 1980, Box et al. 2015, and Seo et al. 2006 for an application).

(b) Vector autoregressive models with exogenous variables (VARX, see e.g. Hannan et

al. 1980 on the methodological framework, and Bogner and Pappenberger 2011 for an

application).

(c) Use of ensemble Kalman filtering techniques (see e.g. Kalman 1960, Evensen 1994,

and Vrugt and Robinson 2007 for an application).

(d) Generalized additive models (GAMLSS, where the distribution parameters of

dependent variables are modelled using regression algorithms, see e.g. Rigby and

Stasinopoulos 2005, and Yan et al. 2014 for an application on river storage forecasts).

In an effort to improve the accuracy of hydrological predictions, methods to combine

probabilistic forecasts originating from the application of algorithmic schemes to the

outputs of hydrological models (hereafter referred to as base-learners; see e.g. Alpaydin

3
2014, p. 487) have started gaining prominence. These include Bayesian Model Averaging

(BMA, see e.g. Min and Zellner 1993, Raftery et al. 1997, 2005), non-homogenous

Gaussian regression (NGR, see e.g. Gneiting et al. 2005) and the beta-transformed linear

pool (BLP, see e.g. Ranjan and Gneiting 2010, Gneiting and Ranjan 2013), among other;

see e.g. the reviews in Bogner et al. (2017), Baran and Lerch (2018) and Wang et al.

(2019).

Most regression models belong to the families of Statistical Learning (SL, see e.g.

Hastie et al. 2009; James et al. 2013) or Machine Learning (ML) algorithms, with the

distinction between the two terms being primarily a matter of scientific debate (see e.g.

Bzdok et al. 2018). For brevity, in what follows, we use the term machine learning (ML)

for the algorithms and general methodological framework, and skip the alternative term.

Machine learning algorithms belong to the class of nonparametric methods, thus, not

providing explicit expressions for the PDFs of the obtained forecasts. The latter need to

be estimated independently in the context of each specific application and hydrological

model used, to be properly combined using methods such as BMA, BLP, NGR etc. (see

above).

Recognizing the need to combine probabilistic predictions without obtaining explicit

expressions for the PDFs of the base-learners, Wang et al. (2019) proposed the

constrained quantile regression averaging (CQRA) method to directly combine quantile

forecasts and predict electricity demand. CQRA is based on the minimization of the

quantile score (QS, see e.g. Koenker and Machado 1999, Friederichs and Hense 2007,

Bentzien and Friederichs 2014, referred to as pinball loss in Wang et al. 2019) over all

targeted quantiles and forecast horizons, using linear programming to estimate optimal

weights for all individual probabilistic forecasts. The method is capable of combining

probabilistic forecasts, independent of whether their predictive PDFs exhibit closed

4
forms (e.g. as in Tyralis and Koutsoyiannis 2014). Note that QS has been consistently

used in hydrological post-processing to characterize the distribution properties of

predictand variables (Bogner et al. 2016, 2017), as well as the quality (reliability,

sharpness, level of calibration etc.) of the predictions.

The aim of this study is to propose a novel method to improve probabilistic

predictions provided by single quantile regression algorithms, by combining

probabilistic hydrological forecasts in the absence of explicit expressions for the PDFs of

the base-learners. We are interested in obtaining central prediction intervals; therefore,

the method is based on the minimization of the interval score (IS, also referred to as

Winkler score, Gneiting and Raftery 2007) and combines base-learners using stacked

generalization (stacking, Wolpert 1992), following the CQRA method. Stacking focuses

on the performance of the combination of the algorithms, in contrast to the widely used

in hydrology Bayesian Model Averaging, which may produce largely inaccurate results,

as proved by Yao et al. (2018). Furthermore, it has been suggested that combining

quantile forecasts (as e.g. in the CQRA method) should be preferred compared to

combining distribution forecasts, e.g. in the context of simple averaging (Lichtendahl Jr

et al. 2013).

We introduce the method with the aim to improve probabilistic predictions when

post-processing the outputs of hydrological models. We assess the proposed

methodology by applying it to 511 basins in the contiguous US (CONUS), using

temperature, precipitation and streamflow data sourced from CAMELS (Catchment

Attributes and MEteorology for Large-sample Studies) dataset. Two experiments are

conducted in the 511 basins, i.e. (a) one-step-ahead prediction (see e.g. Evin et al. 2014)

and (b) post-processing of hydrological model simulations. The assessment is of large

scale (see e.g. the review in Beck et al. 2017) and, therefore, it can effectively serve for

5
validation of the introduced method. Large-scale assessments are increasingly used in

hydrological modelling and forecasting (see e.g. Perrin et al. 2001, Mouelhi et al. 2006a,

b, Bourgin et al. 2015, Langousis et al. 2016, Beck et al. 2017, Tyralis and

Papacharalampous 2017, 2018, Bock et al. 2018, Papacharalampous et al. 2018a, b, c,

2019a, c, Tyralis et al. 2018a, b, Xu et al. 2018), as their results are more general than

those of case studies, while only few large scale studies currently appear in the literature

of hydrological post-processing (see e.g. Pagano et al. 2013).

In Sections 2 and 3.3, we introduce the proposed general framework and its technical

aspects. In Section 4, we apply the suggested approach within the concept of

hydrological post-processing for 511 basins (as outlined above), and illustrate its

improved performance relative to the base-learners used. Sections 4.1 and 5 discuss the

obtained results, as well as general concepts regarding the application of the method.

2. Methods

The definitions and nomenclature for the variables, sets, and methods used hereafter,

are detailed in Appendix A. Appendix B outlines the software packages used to

implement the presented methods and illustrations.

2.1 General introduction

Stacked generalization is a type of ensemble learning (Alpaydin 2014, pp. 487–515)

introduced by Wolpert (1992), where the base-learners are combined using another

learner, usually referred to as the combiner learner (see e.g. Alpaydin 2014, p. 504). A

note to be made here is that ensemble learning of ML algorithms should not be confused

with the general concept of ensemble forecasting in hydrology, which implies that the

estimation variance of hydrological quantities can be obtained from the spread of the

ensemble member forecasts originating from different hydrological models (see e.g.

6
Gneiting et al. 2005). In the context of probabilistic forecasts, ensemble learning stands

for the use of multiple ML algorithms to obtain individual probabilistic forecasts, and

their subsequent combination (through a combiner learner) to obtain prediction

intervals. For example, the CQRA method (Wang et al. 2019) relies on weighted

averaging of predictive quantiles of the base learners.

The base-learners used herein are quantile regression (QR) and quantile regression

forests (QRF, Meinshausen 2006); see Section 2.5 for details. QRF is based on random

forests (RF, Breiman 2001), and it has been used for hydrometeorological post-

processing by Taillardat et al. (2016), as well as in other hydrological applications (see

e.g. Bhuiyan et al. 2018). Here QRF is introduced in the context of hydrological post-

processing. For combiner learner we use the weighted sum of the predictive quantiles,

with estimated weights that minimize the IS.

We produce probabilistic streamflow predictions at daily timescale by post-

processing streamflow simulations. The latter are obtained via the GR4J (Génie Rural à 4

paramètres Journalier) lumped conceptual hydrological model introduced by Perrin et

al. (2003). Other hydrological models can be also used; however, our focus here is on the

post-processing procedure, by considering two experiments:

(a) Experiment 1: One-step ahead predictions (as e.g. in Evin et al. 2014), where at

each time step of the prediction period, the base-learners use observed streamflow

information from the previous day, and the same-day hydrological model output.

(b) Experiment 2: Post-processing of hydrological model simulations, where at each

time step, the base-learners use hydrological model outputs for the current and two

previous days.

The proposed framework can also be applied by selecting different predictor

7
variables for the base-learners (as e.g. Ye et al. 2014), and/or used to obtain

probabilistic predictions at multiple steps ahead.

We run the calibrated hydrological model in simulation mode; i.e. we obtain the

streamflow simulations by using recorded temperature and precipitation as input data

(Klemeš 1986; see e.g. Vrugt and Robinson 2007, Montanari and Grossi 2008, Zhao et al.

2011, Evin et al. 2014, Ye et al. 2014, Dogulu et al. 2015). In this way, we assess the

performance of ML algorithms in post-processing hydrological model outputs, avoiding

possible influences imposed by the accuracy of weather forecasts. For the proposed

methodology to be used for forecasting purposes, one needs to run the hydrological

model in forecast mode; i.e. to use temperature and precipitation forecasts, instead of

recorded quantities (Klemeš 1986). In this case, the PDFs of the predictand variables are

expected to be wider (Hemri 2018), reflecting an increase in the uncertainty of the

predictions. The latter, is imposed by the intrinsically uncertain character of the weather

forecasts. An alternative way to use post-processing approaches in forecasting is to train

the post-processor assuming no uncertainty in the inputs, and then combine input

uncertainty and post-processing (see e.g. Krzysztofowicz 1999, Pagano et al. 2013).

2.2 The ensemble learner

In this Section, we present the general framework of the proposed methodology. Brief

descriptions of its specific components are given in Section 2.5. We define the interval

score of base-learner n at time t for a prediction interval 1 – a, 0 < a < 1, as (Gneiting and

Raftery 2007):

Ln,t,a(yn,t,a/2, yn,t,(1 – a/2), yt) := (yn,t,(1 – a/2) – yn,t,a/2) + (2/a) (yn,t,a/2 – yt) 1(yt < yn,t,a/2) + (2/a)

(yt – yn,t,(1 – a/2)) 1(yt > yn,t,(1 – a/2)) (1)

IS is a proper scoring rule to assess the properties of prediction intervals (see e.g.

8
Gneiting and Raftery 2007), which traces back to Dunsmore (1968) and Winkler (1972)

(see e.g. Gneiting and Raftery 2007) and has been used to assess the quality of

hydrometeorological forecasts (see e.g. Hamill and Wilks 1995) and hydrological

predictions (see e.g. Bock et al. 2018, Papacharalampous et al. 2019b, c). The reliability

score, which is related to IS, has been used to assess the performance of algorithms for

hydrological post-processing (see e.g. Ye et al. 2014).

Also, let t  {1, …, n1 + n2 + n3}, where the period T with available observations has

been divided into three consecutive subperiods T1, T2, and T3 containing n1, n2 and n3

values, respectively. The stacked algorithm is trained in the period {T1, T2}, whereas

period T3 (i.e. an independent period with data not used for training) is used to test the

stacked algorithm. In what follows, we outline the algorithmic steps used to combine the

probabilistic predictions for a specific prediction interval 1 – a (see also Figure 1 for an

illustration):

Step 1 (Train the base-learners in subperiod T1): Each of the n base-learners fn,q(∙), q

 {a/2, 1 – a/2}, is trained independently in subperiod T1, using xt as predictor variables

and yt as dependent variables, where t  T1.

Step 2 (Use the base learners to obtain predictions in subperiod T2): The trained

base-learners of step 1 are used to predict yn,t,q ∀ t  T2, n  N, q  {a/2, 1 – a/2}, where

xt are used as predictor variables of the trained base-learners; i.e. yn,t,q = fn,q(xt).

Step 3 (Stacked generalization): The quantity ∑t Lt,a(yt,a/2, yt,(1 – a/2), yt) is minimized in

subperiod T2, where yt,q = ∑n wn,a yn,t,q , q  {a/2, 1 – a/2}, subject to the constraints

∑n wn,a = 1 and wn,a  [0, 1], n  N. The aim is to obtain proper weights wn,a that minimize

the total loss over different times t; i.e. ∑t Lt,a(yt,a/2, yt,(1 – a/2), yt).

Step 4 (Retrain the base-learners using the whole training period {T1, T2}: Each of the

9
n base learners fn,q(∙), q  {a/2, 1 – a/2}, is trained independently again in the period {T1,

T2} using xt as predictor variables and yt as dependent variables, t  {T1, T2}.

Step 5 (Obtain predictions in test period T3): The predictive quantile yt,q, q  {a/2, 1 –

a/2}, at time t  T3 for a given predictor variable xt, is calculated as yt,q = fe,q(xt), where

fe,q denotes the weighted sum (with weights estimated in Step 3) of the quantiles

obtained from the base-learners trained in period {T1, T2}.

Figure 1. Illustration of the steps of the proposed algorithm. Green horizontal lines refer
to the periods of the data used.

To calculate the weights of the ensemble learner, we employ two alternative

approaches. In the first approach (termed as ensemble learning method 1), for each

value of a, steps 1–5 are applied leading to weight combinations for the base learners

that differ for each prediction interval 1 – a. In the second approach (termed as

ensemble learning method 2), step 3 is modified to minimize the quantity ∑a ∑t Lt,a(yt,a/2,

yt,(1 – a/2), yt) (i.e. the total loss over several prediction intervals and times), instead of ∑t

Lt,a(yt,a/2, yt,(1 – a/2), yt). Hence, in the second approach, the obtained weight combinations

for the base-learners are invariant with respect to the prediction interval 1 – a, i.e. wn

are estimated (instead of wn,a).

2.3 Comparison to other methods

The ensemble learners 1 and 2 are compared to:

10
(a) The simple averaging approach, which assigns equal weights (i.e. in our case ½ for

two base learners) to all quantile forecasts. Simple averaging is an important

benchmark, as it corresponds to “an equally weighted opinion pool that is hard to beat in

practice” (see e.g. Lichtendahl Jr. et al. 2013). Simple averaging has been exploited in

Papacharalampous et al. (2019b, c) in a different context; i.e. by averaging multiple

quantile predictions (on the order of hundreds) obtained using simulations from a single

hydrological model. In this case, simple averaging was selected as an alternative to

weight optimization, which may result in prohibitive computational requirements, due

to the large number of applied weights.

(b) Ensemble learners 3 and 4, which correspond to ensemble learners 1 and 2 (see

previous Section) respectively, with the difference that Step 4 of the algorithm (i.e.

retraining of the base learners in period {T1, T2}) is omitted. Thus, prediction is made

using the trained base learners of Step 1. This comparison allows quantification of the

information gain when retraining the base learners in a longer period (i.e. {T1, T2}).

(c) The QR and QRF base learners used to form the ensemble learners.

2.4 Specific remarks on the proposed algorithm

2.4.1 Fundamental concepts

The proposed algorithm borrows concepts from the fields of hydrology, machine

learning, and statistics. The first basic concept, originating from the field of statistics, is

use of the interval score (IS) defined in eq. (1). Use of IS is substantiated by theoretical

arguments (see e.g. Gneiting and Raftery 2007), with lower values indicating better

performance of the base-learners. IS penalizes wider prediction intervals through

component (yn,t,(1 – a/2) – yn,t,a/2) in eq. (1), as well as intervals that do not contain

observations (i.e. through the component of eq. (1) that remains, after subtraction of the

11
interval width). The latter penalty (hereafter referred to simply as penalty) increases

with the distance of the observations outside the prediction interval and, although more

general, it is implicitly linked to the reliability score (RS), which is defined here for base-

learner n as:

RSn,a := ∑t 1(yn,t  [yn,t,a/2, yn,t,(1 – a/2)])/|T| (2)

An optimal RS should have value equal to 1 – a; i.e. 1 – a of the observed values should

fall inside the 1 – a prediction interval. Ranking of methods can be conducted by

averaging the implemented scores over a fixed set of forecasts (see e.g. Gneiting and

Raftery 2007), with better performing methods exhibiting lower scores.

A second concept, borrowed from the field of machine learning, is implementation of

stacked generalization (see Step 3 above). Stacked generalization (or stacking) is a type

of ensemble learning introduced by Wolpert (1992) (see e.g. Alpaydin 2014, p. 504 for a

comprehensive description of the algorithm), where a combiner learner is used to

“improve” the predictions of the base learners (see e.g. Breiman 1996a, Smyth and

Wolpert 1999), with the latter been used as input. Under this setting, the base learners

and the combiner learner need to be trained over different sets. Here, this is achieved by

splitting the training period into two subperiods T1 and T2. Simultaneous fitting of the

ML algorithms (i.e. Step 1 above) and estimation of the weights (i.e. Steps 2 and 3 above)

using the whole {T1, T2} period is generally not recommended, as ML algorithms tend to

overfit, leading to superior algorithmic performances in the training set relative to

independent test sets. This has been verified also in the context of the present study,

where we found that QRF completely dominated QR.

Other ensemble learning methods also exist; see e.g. the review by Sagi and Rokach

(2018). Two of the most widely used are bagging (Breiman 1996b) and boosting

12
(Friedman 2001, see also the reviews by Natekin and Knoll 2013, and Mayr et al. 2014).

Bagging averages multiple weak learners (i.e. learners with low performance, or

unstable learners), while in boosting new weak base learners are progressively

introduced and trained to minimize the error of the ensemble learner following an

iterative procedure. Thus, new models are progressively added to the ensemble. Instead,

stacking (which is a meta-learning method) uses diverse base-learners to gain in

performance (Sagi and Rokach 2018).

The overall performance of the ensemble learner (formed by the combiner learner

and the base learners) depends on the efficiency of the combiner learner to properly

weigh the base learners within a given test set, and depends on the effectiveness of its

calibration in the training period T2, as well as potential similarities between periods T1

and T2. The splitting problem of a set into training and validation periods is common to

all areas of hydrology and machine learning, and addressing it goes beyond the scopes of

the present study. Here, the training set is partitioned into two subperiods (i.e. T1 and

T2) of almost equal lengths (i.e. 8 and 6 years, respectively), whereas the test set (i.e.

subperiod T3) includes 30% of the available data (i.e. 6 out of 20 years); see Section 3.2

for details. Similar relative lengths for the corresponding training and test periods have

been used in other ML studies; see e.g. (Antal et al. 2003; Yu and Xu 2008;

Papacharalampous et al. 2019a). The overall results can be considered reliable as the

length of the available data allows examination of various patterns of low and high

flows, as well as other statistical attributes of the data.

2.4.2 Differences from existing frameworks

The proposed algorithm borrows concepts from Wang et al. (2019) and Trapero et al.

(2019), who used QS as a loss function, and Yao et al. (2018) who combined closed

13
expressions of probabilistic forecasts. In the former two studies, the weights of the base-

learners were estimated by minimizing the QS across all targeted quantiles and forecast

horizons. Here, we are interested in estimating optimal prediction intervals (i.e. pairs of

quantiles in the form of prediction ranges) and, therefore, minimization of IS is more

suitable than minimization of QS. The latter would lead to doubling the number of the

applied weights (i.e. one weight per bound in QS, vs. one weight per interval in IS), thus

increasing the uncertainty of the resulting predictions. We also note that existing

applications of QS and IS concepts are limited to fields outside hydrology. Additional

advantages when using IS minimization to combine probabilistic forecasts, relative to

other methods (e.g., BMA, BLP, NGR; see Introduction), are that: a) the weight search can

be formulated as a linear programming problem, with considerable increase of accuracy

and computational efficiency of the algorithm, and b) quantile crossing issues are

minimized, as the obtained weights do not depend on posterior distributions that may

present multi-modal features (for an extensive discussion on the merits of stacked

generalization relative to BMA, the reader is referred to Wang et al. 2019). Further

advantages of the method inherit from the properties of stacked generalization, which is

a general methodology with deep theoretical background (for details see Wolpert 1992),

and the fact that the method is simple, straightforward to use, computationally efficient

(i.e. it takes approximately 45 min to process 511 basins with 30 years of data each

including hydrological model simulations, on a regular PC), and practical due to its full

automation (Trapero et al. 2019).

2.5 Base-learners

General guidelines for the selection of base-learners are presented in Alpaydin (2014,

pp. 488–491). In brief, the base-learners should be simple, accurate, and diverse, so they

14
complement each other. Here we use QR and QRF as base-learners, but the method can

combine more than two quantile regression base-learners. The ensemble learner can

also include different base-learners, which originate from the same ML algorithm (e.g.

QRF), implemented with different parameters or predictor variables. Two reviews on

quantile regression algorithms detailing recent progress in the field can be found in

Koenker (2017) and Waldmann (2018). In general, quantile regression algorithms

model the conditional quantiles of dependent variables as functions of predictor

variables. While a detailed presentation of the implemented ML algorithms goes beyond

the scopes of the present study, brief descriptions of the methods and software packages

used for their implementation are presented below.

2.5.1 Quantile regression

Linear in parameters quantile regression (QR) was introduced by Koenker and Bassett

(1978), while an extended treatment of the method can be found in Koenker (2005). The

method uses similar techniques to linear regression, to estimate the quantiles of a

dependent variable, conditional on predictor variables. Its main difference relative to

linear regression, is that minimization is conducted in terms of conditional quantiles,

whereas linear regression considers the conditional mean of the response variable. An

intuitive explanation of QR is that it fits a linear model and bisects the data so that

100 q% lie below the predicted values of the fitted model. Practically, this is done by

fitting a linear model to the data and minimizing the average QS. The method is suitable

for modelling heteroscedasticity (Koenker 2005, p. 25). We apply the method using the

rq R function of the quantreg R package (Koenker 2018), which implements the fitting

algorithm proposed by Koenker and d'Orey (1987, 1994).

15
2.5.2 Quantile regression forests

Quantile regression forests (QRF) were introduced by Meinshausen (2006). The

algorithm is based on random forests (RF, Breiman 2001, see also Biau and Scornet

2016), with interest being on conditional quantiles, rather the conditional mean. RF is a

very accurate algorithm, as proved by its performance in practical problems and

competitions. Examples include successful use of RF in hydrology (see e.g. Nikolopoulos

et al. 2018, Papacharalampous and Tyralis 2018), point time series forecasting in

hydrometeorological applications (see e.g. Tyralis and Papacharalampous 2017,

Papacharalampous et al. 2018a, 2019a) and spatial interpolation of hydrological

quantities (see e.g. Tyralis et al. 2018, 2019b). An extensive review on the use of RF in

water sciences can be found in Tyralis et al. (2019a), and a detailed description of the

algorithm can be found in Hastie et al. (2009, pp. 587–604).

In a regression setting, random forests average an ensemble of decision trees. The

ensemble is created by bagging (abbreviation for bootstrap aggregation; Breiman

1996b) regression trees. In addition to bagging, the splitting at the nodes of the

regression tree is conducted by randomly selecting a fixed number of predictor

variables, thus inducing an additional degree of randomization, which increases

accuracy of the algorithm.

Similar to random forests, which approximate conditional means, quantile regression

forests approximate conditional quantiles. This is done by averaging the indicator

functions of the events exhibiting decision tree outcomes in the test set, lower than a

predefined quantile level.

Here, we apply QRF using the quantile_forest R function of the grf R package

(Tibshirani et al. 2018), which emulates Meinshausen’s (2006) algorithm (see also Athey

16
et al. 2019). The corresponding algorithm is straightforward and very simple to use,

with a few parameters to tune, while the default values in the software implementation

are near optimal (see e.g. the discussion in Verikas et al. 2011, Oshiro et al. 2012,

Scornet et al. 2015, Biau and Scornet 2016, Probst and Boulesteix 2018, Tyralis et al.

2019a). Therefore, optimization of the algorithm is omitted considering, also, that

interest is in the relative improvement of the combiner learner with respect to the base-

learners used. Other properties of random forests are that they demonstrate high

predictive performance, they are non-linear and non-parametric, they are fast compared

to other machine learning algorithms, and they are stable and robust to the inclusion of

noisy predictor variables, while they do not extrapolate outside the training range

within the test set (see e.g. Biau and Scornet 2016, Tyralis et al. 2019a).

3. Data and models

3.1 Data

A detailed presentation of CAMELS dataset, used in the present study, can be found in

Addor et al. (2017a, b), Newman et al. (2014, 2015, 2017) and Thornton et al. (2014).

The dataset comprises of daily hydrometeorological and streamflow data from 671

small- to medium-sized basins in CONUS. For each basin, the daily minimum and

maximum temperatures and precipitation have been obtained by processing the daily

dataset of Thornton et al. (2014). Changes in the basins due to human influences are

minimal, therefore use of ML algorithms for uncertainty characterization is an

acceptable option; see e.g. Solomatine and Wagener (2011) regarding the requirements

for statistical similarity between subperiods when applying ML methods, and

Koutsoyiannis and Montanari (2015) regarding the appropriateness of the assumption

of stationarity when changes cannot be explained deductively. Here we focus on the 34-

17
year period 1980-2013, and exclude basins with missing data or other inconsistencies.

The final sample consists of 511 basins representing most climate types over CONUS;

see Figure 2.

Figure 2. The 511 basins over CONUS used in the study.

For each of the 511 basins, we estimate the mean daily temperature as the average of

the respective minimum and maximum daily temperatures. The daily potential

evapotranspiration (PET) is estimated by implementing Oudin’s formula (Oudin et al.

2005). For the latter, we use the PEdaily_Oudin R function of the airGR R package

(for details see Coron et al. 2017, 2018), with the daily mean temperature as input.

3.2 Hydrological model

The GR4J model constitutes an improvement of the GR3J (Génie Rural à 3 paramètres

Journalier) model by Edijatno et al. (1999), and comprises of four parameters, while its

precursor (i.e. GR3J) comprises of three parameters (Perrin et al. 2003). The use of this

small number of parameters is fully justified in Perrin et al. (2001). The hydrological

model is herein calibrated in a non-adaptive way; i.e. the calibration is performed once

for each basin and the hydrological model is thereafter applied with fixed parameter

values (see e.g. Toth et al. 1999). Although feasible, we do not perform adaptive

calibration (see e.g. Brath and Rosso 1993, Ye et al. 2014), as its benefits are delivered

by the base-learners in the context of the hydrological post-processing framework (Toth

18
et al. 1999).

We use the airGR R package to apply the GR4J hydrological model to each basin. We

simulate daily streamflow with recorded daily precipitation and PET as input. The

period 1980-1981 is used to warm up the hydrological model, while period 1982-1993

is used for model calibration using the Calibration_Michel R function of the

airGR R package. The latter function implements Michel’s (1991) optimization

algorithm using the Nash–Sutcliffe criterion (Nash and Sutcliffe 1970), to characterize

the quality of the hydrological simulations relative to recorded streamflows.

Following the notation presented in Section 2.2, we define the periods: T1 = {1994-01-

01, …, 2001-12-31}, T2 = {2002-01-01, …, 2007-12-31}, T3 = {2008-01-01, …, 2013-12-

31}, and use the calibrated hydrological model to simulate daily streamflows for the

total period T = {T1, T2, T3}. The simulated streamflow vt at time t is calculated using

information until day 1993-12-31 for yt (i.e. the recorded streamflow), and until day t for

prt and pett (i.e. precipitation and potential evapotranspiration, respectively). The final

product consists of 511 simulated streamflow series at a daily resolution in period T,

with a total of 1 120 112 simulated values in period T3, where the ensemble learner is

tested (i.e. 2 192 simulated streamflow values for each of the 511 basins).

3.3 Technical aspects

Post-processing aims at estimating the uncertainty of the predictand variable

conditional on model simulations (see Introduction). In Experiment 1 (i.e. one-step

ahead predictions; see Section 2.1) the predictor variable is defined as xt = (yt – 1, vt). Use

of the last streamflow observation yt-1 as predictor is supported by numerous relevant

examples (see e.g. Krzysztofowicz 1987, Seo et al. 2006, Evin et al. 2014, Bogner et al.

2016), due to the high magnitude of dependence between sequential streamflow

19
observations. In Experiment 2, the predictor variable is defined as xt = (vt, vt – 1, vt – 2) and

corresponds to the case of post-processing hydrological model simulations. The q-th

quantiles of the predictand can be obtained by post-processing xt through: yntq = fe,q(xt).

When using ML algorithms, it is common to pre-process the data by applying some

transformation, with the aim to increase the performance of the model. Appropriate

transformations can be applied to both yt and vt. Several options are available in the

existing literature, such as the arcsinh(∙), log(∙), square root, Box-Cox, and Yeo-Johnson

transformations, as well as the normal quantile transformation (see e.g. Krzysztofowicz

1997, Bogner et al. 2012). All aforementioned normalization transformations can be

implemented using the bestNormalize R package (Peterson 2018). The selected

transformation should be applied to both simulated and observed streamflows in the

training sets, and all ML calculations should be performed using transformed quantities.

The inverse transformation is then applied to the predicted quantiles. We tried all

previously mentioned transformations, and found that the square root transformation

was the only one not resulting in unrealistically high quantiles by the QR algorithm in

the examined cases. When compared to conventional statistical methods (Waldmann

2018), QR is more robust and less sensitive to the existence of outliers of the dependent

variable, while RF is invariant to monotonic transformations of the predictor variables

(Díaz-Uriarte and De Andres 2006). The square root transformation has also been used

by Messner (2018) for the purposes of hydrometeorological post-processing.

Other issues, which need to be addressed in most post-processing applications,

include heteroscedastic behaviour of the data, censoring (i.e. in case the predicted

quantiles exhibit negative values), and quantile crossing problems. Regarding

heteroscedasticity, it can be theoretically addressed by using base-learners that can

effectively model heteroscedastic behaviour, such as the QR used in the present study.

20
Problems of negative quantiles were minimal in the present application. In the case of

QRF base-learners, negative values are by definition not possible, as the predicted

quantiles constitute subsets of the values found in the training set. For the QR base-

learners, the problem of negative quantiles was addressed by censoring them. Quantile

crossing problems were also minimal in the present application, and have been

addressed by properly adjusting the corresponding quantiles, similar to the approach of

Wang et al. (2019). According to the latter, if quantile q1 results to be larger than

quantile q2, with q1 < q2, then quantile q2 is set equal to quantile q1.

4. Results and Discussion

For period T3, Figure 3 summarizes information on the simulated and observed

streamflows for all basins analysed. Figure 3.a presents a scatterplot for the same- and

previous-day observed discharges, yt and yt – 1, respectively, while Figures 3.b – d show

scatterplots of yt with respect to the simulated streamflows vt , vt – 1 and vt – 2,

respectively. Regarding Figure 3.b, one sees that the linear regression line (red) between

the same-day hydrological simulations and observations is close to the 45-degree line

(black), indicating that the hydrological model pre-processes the data relatively well.

However, there seems to be a moderate negative bias in the estimation of high flows, as

indicated by the points lying above the 45-degree line. Also, as physically expected, the

deviation between observed and simulated flows increases with increasing lag-times;

see Figures 3.c, d.

21
Figure 3. Scatterplots of yt versus: (a) yt–1, (b) vt, (c) vt–1, and (d) vt–2 for all basins, and t
in the T3 period. The 45-degree line (black) and the linear regression line (red) between
the variables of the two axes are also presented.

Figure 3.a illustrates the significant positive correlation of observed streamflows in

two sequential days (i.e. yt, and yt – 1), indicating the appropriateness of using xt = (yt – 1,

vt) as a predictor variable in hydrological post-processing. Clearly, the deviation of the

linear regression line in Figure 3.a from the 45-degree line is larger than that in Figure

3.b, indicating the important pre-processing role of the hydrological model. Regarding

high flows, the respective points in Figure 3.a are scattered symmetrically around the

45-degree line. This is statistically justifiable, as the probabilities of observing higher or

lower flows in day t with respect to day t – 1 are approximately equal.

22
The appropriateness of using xt = (yt – 1, vt) as a predictor variable in hydrological

post-processing is also illustrated in Figure 4, which shows histograms of correlations

(obtained for validation period T3 and all considered basins) between yt and (a) yt–1, (b)

vt, (c) vt–1, and (d) vt–2. One sees that the correlations between yt, and each of the

variables yt–1 and vt are generally higher relative to the correlations between the

observed streamflow yt at time t, and the simulated streamflows at earlier times (i.e. vt–1

and vt–2). Correlation histograms obtained for validation period {T1, T2} are similar to

those in Figure 4 (not shown here for brevity).

Figure 4. Histogram of correlations obtained for validation period T3 between: (a)


current-day yt and previous-day yt-1 observed streamflows, (b) current-day observed yt
and simulated vt streamflows, (c) current-day observed yt and previous-day simulated vt-
1 streamflows (i.e. 1 day lag time), and (d) current-day observed yt and simulated vt-2
streamflows (i.e. 2 days lag time).

Figure 5 shows an example of post-processed hydrological simulations in the context

of experiments 1 and 2 at an arbitrary basin. The 0.025 and 0.975 quantiles of the base-

learners QR and QRF, and ensemble learners 1 and 2 are also presented. Visual

inspection of the post-processed simulations indicates that QR, QRF, and ensemble

23
learners 1 and 2 produce intervals that, in general, include yt. In experiment 2, the

prediction intervals are wider, due to the larger degree of uncertainty induced by the

absence of the previous-day observed streamflow yt–1 as predictor variable. In the next

Section, we quantitatively assess the performance of each method (including ensemble

learners 3 and 4, as well as simple averaging) using proper metrics.

Figure 5. Illustration of observed streamflows (black solid lines), and predicted


quantiles (dotted lines) for: (a) experiment 1, and (b) experiment 2, for a 1-year sub-
period of T3 at an arbitrary basin. Different post-processing methods are indicated with
different colours.

4.1 Performance assessment

For brevity, and without loss of generality, in what follows we centre the discussion to

experiment 1. The results of experiment 2 are presented through comparison of

24
performances relative to experiment 1. For all basins analysed, we assess the predictive

performance of ensemble learners 1–4 and the simple averaging method in period T3.

The assessment is made by estimating the relative improvement (RI) introduced with

respect to each of the base-learners. For instance, the relative improvement (RI) of the

interval score of learner i with respect to the nth base-learner (used for benchmarking) is

defined as:

RI := (Ln,a - Li,a)/ Ln,a (3)

Similarly, by substituting the interval score by its components, i.e. interval widths, and

penalty (see Section 2.4.1), one can obtain their relative improvements as well; see

Section 4.2 for a detailed analysis and presentation of findings.

Regarding experiment 1, Figure 6.a shows the mean RI (over all basins) of ensemble

learners 1–4 and simple averaging with respect to QR, for different prediction intervals

1 – a = 20, 40, 60, 80, 90, 95%. Figure 6.b presents similar results to Figure 6.a, but for

experiment 2.

Figure 6. Mean relative improvement (over all basins) of the interval score (IS) with
respect to QR in: (a) experiment 1, and (b) experiment 2, for different prediction
intervals 1 – a = 20, 40, 60, 80, 90, 95%.

A positive value of RI indicates that the examined learner improves over the

benchmark learner. Values equal to 0 indicate that the examined and benchmark

learners are identical. Regarding experiment 1, when compared to QR, ensemble

25
learners 1 and 2 improve more than 10% at prediction intervals below 80%, while the

relative improvement decreases to 5% at higher prediction intervals. When compared to

QRF, the relative improvement is 1-2% at low prediction intervals, and increases to

more than 5% at higher prediction intervals. Lower prediction intervals are

representative of the median values of the streamflow, whereas higher prediction

intervals can be used to predict low and high flows. The diverse properties of the two

base-learners with respect to the magnitude of the prediction interval are also

presented in Figure 6. The performance of the QRF base-learner is 9% better relative to

QR at low prediction intervals, and it decreases at higher prediction intervals. A

probable reason for this is that, by construction, QRF cannot predict beyond the range of

observed flows in the training set, whereas the QR algorithm is regression based

allowing for extrapolation beyond this range.

At all prediction intervals considered, the performances of ensemble learners 3 and 4

are approximately 2% lower compared to ensemble learners 1 and 2, respectively.

Consequently, retraining of the algorithm in period T2 (Step 4) is beneficial and should

be preferred. The relative improvement of ensemble learner 1 over simple averaging is

approximately 2% at prediction intervals below 80%, with the two methods sharing

similarly good performances at higher prediction intervals.

Regarding experiment 2 (see Figure 6.b), one sees that the RI curves are shifted

downwards relative to Figure 6.a, indicating lower overall performances associated with

the larger degree of uncertainty (relative to experiment 1) induced by the absence of the

previous-day observed streamflow as predictor variable. In addition, while ensemble

learners 1 and 2 perform better than the base-learners, simple averaging performs

equally well to both ensemble learners 1 and 2 at all prediction intervals. This important

result indicates that the outcome of optimal weight selection is strongly influenced by

26
the uncertainty of the predictor-predictand relationship. More precisely, as the level of

uncertainty increases (e.g. experiment 2 relative to experiment 1) weight optimization

may not lead to significant improvements relative to simple averaging; i.e. a uniform

weighting scheme that assigns equal weights to all base-learners.

When averaged over all prediction intervals, the relative improvement of the interval

score of ensemble learner 1 in experiment 1 is 8.84% with respect to QR, and 4.43%

with respect to QRF. The corresponding improvements introduced by ensemble learner

2 are 8.55% with respect to QR and 4.18% with respect to QRF, and by simple averaging

are 7.90% and 3.60%, respectively. The slight improvement of ensemble learner 1 in

experiment 1 should be attributed to its higher flexibility, not compromised by

overfitting. Clearly, the two ensemble learners are able to exploit the diverse properties

of the base-learners and improve uniformly over them, demonstrating improved

performance relative to simple averaging by approximately 1%. Also, it follows from the

discussion above, that the first ensemble learner is approximately 0.5% more efficient

relative to the second one. The reason for this is that the first learner uses a combiner

algorithm that allows for additional degrees of freedom, as the weights applied to the

base-learners may vary with the prediction interval 1 – a (see Section 2.2). Note that the

aforementioned increase of predictive performances over the base-learners is

significant, especially due to the size of the test set (i.e. 511 time series, each one

consisting of 34 years of daily streamflow observations). For example, in their study,

Wang et al. (2019) indicate 4.39% average relative improvement of the quantile score

with respect to the three base-learners used, based on eight daily time series of

electricity consumption, each one consisting of four years of data. Although smaller (due

to the larger uncertainty of the predictor-predictand relationship; see above),

improvements of ensemble learners 1 and 2 over the base-learners are also observed in

27
experiment 2 (see Figure 6.b). In addition, both ensemble learners appear to be overall

equivalent to simple averaging, indicating that weight optimization does not lead to

significant improvements relative to the uniform weighting scheme of simple averaging.

Figure 7 presents histograms of the relative improvements of the IS (see eq. (3))

introduced by the two ensemble learners in experiment 1, for all considered basins,

relative to the two base-learners. Each histogram consists of 3 066 values, which

correspond to six values (i.e. one per prediction interval 1 – a = 20, 40, 60, 80, 90, 95%)

per basin. In all cases, the improvements are mostly positive and well dispersed,

indicating that the results presented in Figure 6 (i.e. the mean relative improvement of

each ensemble learner relative to the two base learners) are not dominated by

exceptional performances of the ensemble learners over a limited set of basins.

Figure 7. Histograms of relative improvements in terms of IS, as computed for all basins
and prediction intervals in experiment 1, for ensemble learner 1 (left panels) and
ensemble learner 2 (right panels). The relative improvements with respect to quantile
regression (QR) are illustrated in the top panels, and with respect to quantile regression
forests (QRF) in the bottom panels.

28
Figure 8 and Figure 9 present boxplots of the average interval scores (IS) in

experiment 1 and experiment 2, respectively, for period T3. One sees that: a)

independent of the experiment and method used, IS increases with increasing 1 – a; b)

in both experiments, ensemble learners 1-4 and simple averaging improve over the

base-learners, and c) IS in experiment 1 are generally lower than those in experiment 2,

thus confirming that yt–1 (i.e. used as predictor variable in experiment 1) is more

informative than vt–1 and vt–2 combined (i.e. used as predictor variables in experiment 2).

29
Figure 8. Notched boxplots of average interval scores for experiment 1 in period T3, for
different prediction intervals 1 – a = (a) 20, (b) 40, (c) 60, (d) 80, (e) 90, (f) 95%. The
lower and upper hinges of the boxes correspond to the first and third quartiles. Values
exceeding the third quartile by more than 1.5 times the interquartile range, are
considered as outliers (denoted by dots).

30
Figure 9. Notched boxplots of average interval scores for experiment 2 in period T3, for
different prediction intervals 1 – a = (a) 20, (b) 40, (c) 60, (d) 80, (e) 90, (f) 95%. The
lower and upper hinges of the boxes correspond to the first and third quartiles. Values
exceeding the third quartile by more than 1.5 times the interquartile range, are
considered as outliers (denoted by dots).

4.2 Components of the interval score

To gain further insight regarding the performance of each method in the testing period

T3, Figure 10 presents, for both experiments, the ensemble mean (over all basins) of the

absolute differences between the reliability scores (see Section 2.4.1) and the

corresponding nominal prediction intervals.

31
Figure 10. Ensemble mean (over all basins) of the absolute differences between the
reliability scores and the corresponding nominal values, for prediction intervals 1 – a =
20, 40, 60, 80, 90, 95%, in: (a) experiment 1, and (b) experiment 2.

One can see that, in experiment 1, QR performs better than QRF at prediction

intervals below 60%, whereas the performances are reversed at higher prediction

intervals. The differences are less pronounced in experiment 2, with QR performing

better than QRF at all prediction intervals. In both experiments, ensemble learners 3 and

4 demonstrate limited performance relative to ensemble learners 1 and 2, with the latter

two exhibiting similar performances to simple averaging, balancing those of QR and QRF

base-learners.

Figure 11 presents the median relative improvements (i.e. with respect to QR) of the

performances of base-learner QRF, ensemble learners 1-4, and simple averaging, in

terms of prediction interval widths. Median values are preferred over mean values, to

avoid influences by very low (i.e. near-zero) prediction intervals. While the

performances of all methods are comparable in experiment 2 (see Figure 11.b) due to

the higher level of uncertainty induced by the absence of the previous-day observed

streamflow as predictor variable, in experiment 1 (see Figure 11.a) QR uniformly

dominates QRF and all ensemble methods used.

32
Figure 11. Median relative improvement (over all basins) of interval widths with respect
to QR in: (a) experiment 1, and (b) experiment 2, for different prediction intervals 1 – a
= 20, 40, 60, 80, 90, 95%.

Figure 12 presents the ensemble mean (over all basins) of the relative improvements

(with respect to QR) of penalties associated with intervals that do not contain

observations (see Section 2.4.1). The general pattern is similar to that of interval scores

in Figure 6, indicating that penalties are an important contributor to the interval score.

Figure 12. Mean relative improvement (over all basins) of penalties with respect to QR
in: (a) experiment 1, and (b) experiment 2, for different prediction intervals 1 – a = 20,
40, 60, 80, 90, 95%.

4.3 Weights

To gain insight on how the weights of the ensemble learners (see Section 2.2) are

affected by the performances of the base-learners, Figure 13 shows for ensemble learner

1, scatterplots of the weights assigned to the QR base-learner as a function of the relative

improvement of the average interval score of QRF relative to QR, for different prediction

33
intervals 1 – a. As expected, one sees that independent of the experiment (i.e. 1 Figure

13.a, or 2 Figure 13.b) the weights assigned to the QR base-learner tend to decrease

when the relative improvement of QRF relative to QR increases.

Figure 13. Scatterplots of the weights of the quantile regression algorithm exploited
through ensemble learner 1 in: (a) experiment 1, and (b) experiment 2, against the
relative improvement of the average interval score of QRF relative to QR in period T3.

For ensemble learner 1 and experiment 1, Figure 14 presents histograms of the

weights assigned to the QR base-learner for varying prediction intervals 1 – a. When the

prediction interval 1 – a increases, the weights increase as well. This should be expected

because the relative gain in performance when referring to the interval score of QR with

respect to QRF increases for higher prediction intervals. Spikes at the edges of the

histograms correspond to cases where the QR base-learner completely dominates QRF

(or vice versa) resulting in weights equal to 1 (or 0 respectively).

34
Figure 14. Histograms of the weights of the quantile regression (QR) algorithm exploited
through ensemble learner 1 in experiment 1 for different prediction intervals 1 – a = (a)
20, (b) 40, (c) 60, (d) 80, (e) 90, (f) 95%.

5. Concluding remarks

Ensemble learning of base-learners can result in improved performance of probabilistic

predictions. The few existing methods require formal definition of the likelihoods of the

base-learners, which is too restrictive, as most base-learners cannot provide explicit

expressions for the PDFs of the obtained forecasts. In this study, we borrowed concepts

from Wang et al. (2019) to propose an ensemble learner, which uses stacked

generalization to linearly combine the quantile predictions of non-parametric base-

35
learners (i.e. quantile regression and quantile regression forests algorithms), using

weights that minimize the interval score of the resulting prediction.

The method was tested using a large dataset consisting of 511 basins. The conducted

tests focused in delivering one-step ahead predictions (experiment 1), as well as in post-

processing simulations of a conceptual hydrological model (experiment 2). It was found

that the ensemble learners improve over the performance of the best base-learner by 1-

5%, depending on the experiment and the prediction interval. The suggested method

was also found to outperform simple averaging (i.e. a uniform weighting scheme that

assigns equal weights to all base learners), or to be sharing the first place with it in all

examined cases, with the maximum obtained improvement over this tough benchmark

being approximately equal to 2%.

The results are considered significant, especially given the length of the sample the

algorithm has been tested (i.e. post-processing of 1 120 112 hydrological predictions

from 511 time series) and the fact that simple averaging is hard to beat in practice (see

e.g. Lichtendahl Jr et al. 2013). The latter general observation indicates that when the

uncertainty of the predictor-predictand relationship increases, the effectiveness of

weight optimization tends to decrease, approaching that of simple averaging; i.e. a

uniform weighting scheme with equal weights assigned to all base-learners.

To the best of our knowledge, no similar study has been conducted in the

hydrological literature, with the closest work being that of Wang et al. (2019) in

electricity forecasting. The latter is based on minimization of the quantile score (QS, see

Introduction), indicating 4.39% average relative improvement with respect to the three

base-learners used, whereas in the present study ensemble learning is conducted by

minimizing the interval score (IS), resulting e.g. in experiment 1 to approximately 6.5%

average relative improvement over the 2 base-learners (i.e. (8.98% + 4.44% + 8.66% +

36
4.18%)/4; see Section 4.1). Also, note that application of the constrained quantile

regression averaging (CQRA) method of Wang et al. (2019) was based on eight daily

time series of electricity consumption, each one consisting of four years of data.

One should consider the convenience of using the proposed method over other

combination methods (e.g. Bayesian Model Averaging), as well as theoretical studies that

support: a) stacking against Bayesian Model Averaging, and b) working with quantile

forecasts instead of probability distributions (see also Section 1). The extended use of

machine algorithms in hydrological post-processing should also be considered. Machine

learning algorithms are accurate, they have been tested extensively in practice as well as

in forecasting competitions, they are easy to apply due to their open software

implementation, and they are optimally programmed, resulting in considerable decrease

of computation times (the computations of the present study, including fitting of the

hydrological model, required approximately 45 min on a regular PC), thus allowing

large-scale implementations.

Based on the aforementioned findings, we recommend use of the proposed ensemble

learners for improving the probabilistic predictions of base-learners. Future research

could focus on defining optimal splitting points of the training set used, inclusion of

more base-learners, testing the method using forecasts of daily temperature and

precipitation as input, as well as assessing the performance of stacked generalization

when metrics/scores, other than IS (see e.g. Gneiting and Raftery 2007 and Shastri et al.

2017), are minimized to optimally combine probabilistic forecasts. Further uses of the

method are also possible, spanning from hydrological forecasting using data-driven

models, to water demand forecasting, to water science problems, and beyond; e.g. in

electricity load forecasting and more.

37
Conflicts of interest: We declare no conflict of interest.

Acknowledgements: We are grateful to the Editor, Associate Editor, and the reviewers
for their constructive comments and suggestions, which helped us to improve the
manuscript.

Appendix A Nomenclature

Indices

1–a Central prediction interval

n Index of base-learners

t Index of time periods

q Index of quantiles

Sets

N Set of base-learners

T Set of time periods

Q Set of quantiles

Functions

1(∙) The indicator function

|A| Cardinality of a set A

fn,q(∙) The n-th base learner for the q-th quantile

fe,q(∙) The ensemble learner for the q-th quantile

fn(∙) The n-th base learner

Variables

xt Inputs of a model at time t

yt Actual streamflow at time t

prt Observed daily precipitation at time t

pett Observed daily potential evapotranspiration at time t

vt Simulated daily streamflow at time t

yn,t,q The forecasted q-th quantile of the n-th base-learner at time t

yt,q The forecasted q-th quantile of the ensemble learner at time t

Ln,t,a Interval score of the n-th base-learner at time t for the 1 – a prediction

38
interval

Lt,a Interval score of the weighted average of the N methods at time t for the 1 – a
prediction interval

Appendix B Used software

All computations and visualizations were conducted in R Programming Language (R

Core Team 2019) using the following packages: airGR (Coron et al. 2017, 2019),

bestNormalize (Peterson 2018), data.table (Dowle and Srinivasan 2019),

devtools (Wickham et al. 2019c), doParallel (Microsoft Corporation and Weston

2017), dplyr (Wickham et al. 2019b), foreach (Microsoft and Weston 2018), gdata

(Warnes et al. 2017), ggplot2 (Wickham 2016; Wickham et al. 2019a), grf (Tibshirani

et al. 2018), knitr (Xie 2014, 2015, 2019), quantreg (Koenker 2018), readr

(Wickham et al. 2018), reshape2 (Wickham 2007, 2017), rmarkdown (Allaire et al.

2019), stringi (Gagolewski 2019), stringr (Wickham 2019).

References

[1] Addor N, Newman AJ, Mizukami N, Clark MP (2017a) Catchment attributes for
large-sample studies. Boulder, CO: UCAR/NCAR.
https://doi.org/10.5065/D6G73C3Q
[2] Addor N, Newman AJ, Mizukami N, Clark MP (2017b) The CAMELS data set:
Catchment attributes and meteorology for large-sample studies. Hydrology and
Earth System Sciences 21:5293–5313. https://doi.org/10.5194/hess-21-5293-
2017
[3] Allaire JJ, Xie Y, McPherson J, Luraschi J, Ushey K, Atkins A, Wickham H, Cheng J,
Chang W, Iannone R (2019) rmarkdown: Dynamic documents for R. R package
version 1.12. https://CRAN.R-project.org/package=rmarkdown
[4] Alpaydin E (2014) Introduction to Machine Learning, 3rd Edition. The MIT Press,
Cambridge, Massachusetts
[5] Antal P, Fannes G, Timmerman D, Moreau Y, de Moor B (2003) Bayesian
applications of belief networks and multilayer perceptrons for ovarian tumor
classification with rejection. Artificial Intelligence in Medicine 29(1–2):39–60.
https://doi.org/10.1016/S0933-3657(03)00053-8
[6] Athey S, Tibshirani J, Wager S (2019) Generalized random forests. The Annals of
Statistics 47(2):1148–1178. https://doi.org/10.1214/18-AOS1709
[7] Baran S, Lerch S (2018) Combining predictive distributions for the statistical
post-processing of ensemble forecasts. International Journal of Forecasting
34(3):477–496. https://doi.org/10.1016/j.ijforecast.2018.01.005

39
[8] Beck HE, van Dijk AIJM, de Roo A, Dutra E, Fink G, Orth R, Schellekens J (2017)
Global evaluation of runoff from 10 state-of-the-art hydrological models.
Hydrology and Earth System Sciences 21(6):2881–2903.
https://doi.org/10.5194/hess-21-2881-2017
[9] Bentzien S, Friederichs P (2014) Decomposition and graphical portrayal of the
quantile score. Quarterly Journal of the Royal Meteorological Society
140(683):1924–1934. https://doi.org/10.1002/qj.2284
[10] Bhuiyan MAE, Nikolopoulos EI, Anagnostou EN, Quintana-Seguí P, Barella-Ortiz
A (2018) A nonparametric statistical technique for combining global
precipitation datasets: development and hydrological evaluation over the
Iberian Peninsula. Hydrology and Earth System Sciences 22:1371–1389.
https://doi.org/10.5194/hess-22-1371-2018
[11] Biau G, Scornet E (2016) A random forest guided tour. TEST 25(2):197–227.
https://doi.org/10.1007/s11749-016-0481-7
[12] Bock AR, Farmer WH, Hay LE (2018) Quantifying uncertainty in simulated
streamflow and runoff from a continental-scale monthly water balance model.
Advances in Water Resources 122:166–175.
https://doi.org/10.1016/j.advwatres.2018.10.005
[13] Bogner K, Pappenberger F (2011) Multiscale error analysis, correction, and
predictive uncertainty estimation in a flood forecasting system. Water
Resources Research 47(7):W07524. https://doi.org/10.1029/2010WR009137
[14] Bogner K, Pappenberger F, Cloke HL (2012) Technical Note: The normal
quantile transformation and its application in a flood forecasting system.
Hydrology and Earth System Sciences 16:1085–1094.
https://doi.org/10.5194/hess-16-1085-2012
[15] Bogner K, Liechti K, Zappa M (2016) Post-processing of stream flows in
Switzerland with an emphasis on low flows and floods. Water 8(4):115.
https://doi.org/10.3390/w8040115
[16] Bogner K, Liechti K, Zappa M (2017) Technical note: Combining quantile
forecasts and predictive distributions of streamflows. Hydrology and Earth
System Sciences 21:5493–5502. https://doi.org/10.5194/hess-21-5493-2017
[17] Bourgin F, Andréassian V, Perrin C, Oudin L (2015) Transferring global
uncertainty estimates from gauged to ungauged catchments. Hydrology and
Earth System Sciences 19:2535–2546. https://doi.org/10.5194/hess-19-2535-
2015
[18] Box GEP, Jenkins GM, Cloke HL, Reinsel GC, Ljung GM (2015) Time Series
Analysis: Forecasting and Control, 5th Edition. John Wiley & Sons, Inc., Hoboken,
New Jersey
[19] Brath A, Rosso R (1993) Adaptive calibration of a conceptual model for flash
flood forecasting. Water Resources Research 29(8):2561–2572.
https://doi.org/10.1029/93WR00665
[20] Breiman L (1996a) Stacked regressions. Machine Learning 24(1):49–64.
https://doi.org/10.1007/BF00117832
[21] Breiman L (1996b) Bagging predictors. Machine Learning 24(2):123–140.
https://doi.org/10.1007/BF00058655
[22] Breiman L (2001) Random forests. Machine Learning 45(1):5–32.
https://doi.org/10.1023/A:1010933404324
[23] Bzdok D, Altman N, Krzywinski M (2018) Statistics versus machine learning.
Nature Methods 15:233–234. https://doi.org/10.1038/nmeth.4642

40
[24] Coron L, Thirel G, Delaigue O, Perrin C, Andréassian V (2017) The suite of
lumped GR hydrological models in an R package. Environmental Modelling and
Software 94:166–171. https://doi.org/10.1016/j.envsoft.2017.05.002
[25] Coron L, Delaigue O, Thirel G, Perrin C, Michel C (2019) airGR: Suite of GR
hydrological models for precipitation-runoff modelling. R package version
1.2.13.16. https://CRAN.R-project.org/package=airGR
[26] Das T, Bárdossy A, Zehe E, He Y (2008) Comparison of conceptual model
performance using different representations of spatial variability. Journal of
Hydrology 356(1–2):106–118. https://doi.org/10.1016/j.jhydrol.2008.04.008
[27] Díaz-Uriarte R, De Andres SA (2006) Gene selection and classification of
microarray data using random forest. BMC Bioinformatics 7:3.
https://doi.org/10.1186/1471-2105-7-3
[28] Dogulu N, López López P, Solomatine DP, Weerts AH, Shrestha DL (2015)
Estimation of predictive hydrologic uncertainty using the quantile regression
and UNEEC methods and their comparison on contrasting catchments.
Hydrology and Earth System Sciences 19:3181–3201.
https://doi.org/10.5194/hess-19-3181-2015
[29] Dowle M, Srinivasan A (2019) data.table: Extension of 'data.frame'. R package
version 1.12.2. https://CRAN.R-project.org/package=data.table
[30] Dunsmore IR (1968) A Bayesian approach to calibration. Journal of the Royal
Statistical Society. Series B (Methodological) 30(2):396–405.
https://doi.org/10.1016/j.rser.2018.05.038
[31] Edijatno, Nascimento NO, Yang X, Makhlouf Z, Michel C (1999) GR3J: A daily
watershed model with three free parameters. Hydrological Sciences Journal
44(2):263–277. https://doi.org/10.1080/02626669909492221
[32] Evensen G (1994) Sequential data assimilation with a nonlinear
quasi-geostrophic model using Monte Carlo methods to forecast error statistics.
Journal of Geophysical Research 99(C5):10143–10162.
https://doi.org/10.1029/94JC00572
[33] Evin G, Thyer M, Kavetski D, McInerney D, Kuczera G (2014) Comparison of joint
versus postprocessor approaches for hydrological uncertainty estimation
accounting for error autocorrelation and heteroscedasticity. Water Resources
Research 50(3):2350–2375. https://doi.org/10.1002/2013WR014185
[34] Friederichs P, Hense A (2007) Statistical downscaling of extreme precipitation
events using censored quantile regression. Monthly Weather Review 135:2365–
2378. https://doi.org/10.1175/MWR3403.1
[35] Friedman JH (2001) Greedy function approximation: A gradient boosting
machine. The Annals of Statistics 29(5):1189–1232.
https://doi.org/10.1214/aos/1013203451
[36] Gagolewski M (2019) stringi: Character string processing facilities. R package
version 1.4.3. https://CRAN.R-project.org/package=stringi
[37] Gneiting T, Raftery AE (2007) Strictly proper scoring rules, prediction, and
estimation. Journal of the American Statistical Association 102(477):359–378.
https://doi.org/10.1198/016214506000001437
[38] Gneiting T, Ranjan R (2013) Combining predictive distributions. Electronic
Journal of Statistics 7:1747–1782. https://doi.org/10.1214/13-EJS823

41
[39] Gneiting T, Raftery AE, Westveld AH, Goldman T (2005) Calibrated probabilistic
forecasting using ensemble model output statistics and minimum CRPS
Estimation. Monthly Weather Review 133:1098–1118.
https://doi.org/10.1175/MWR2904.1
[40] Hamill TM, Wilks DS (1995) A Probabilistic forecast contest and the difficulty in
assessing short-range forecast uncertainty. Weather and Forecasting 10:620–
631. https://doi.org/10.1175/1520-0434(1995)010<0620:APFCAT>2.0.CO;2
[41] Hannan EJ, Dunsmuir WTM, Deistler M (1980) Estimation of vector ARMAX
models. Journal of Multivariate Analysis 10(3):275–295.
https://doi.org/10.1016/0047-259X(80)90050-0
[42] Hastie T, Tibshirani R, Friedman J (2009) The Elements of Statistical Learning.
Springer-Verlag New York. https://doi.org/10.1007/978-0-387-84858-7
[43] Hemri S (2018) Chapter 8 - Applications of Postprocessing for Hydrological
Forecasts. In: Vannitsem S, Wilks DS, Messner JW (eds) Statistical
Postprocessing of Ensemble Forecasts. Elsevier, pp 219–240.
https://doi.org/10.1016/B978-0-12-812372-0.00008-X
[44] Hernández-López MR, Francés F (2017) Bayesian joint inference of hydrological
and generalized error models with the enforcement of Total Laws. Hydrology
and Earth System Sciences Discussions. https://doi.org/10.5194/hess-2017-9
[45] Hong T, Pinson P, Fan S, Zareipour H, Troccoli A, Hyndman RJ (2016)
Probabilistic energy forecasting: Global Energy Forecasting Competition 2014
and beyond. International Journal of Forecasting 32(3):896–913.
https://doi.org/10.1016/j.ijforecast.2016.02.001
[46] James G, Witten D, Hastie T, Tibshirani R (2013) An Introduction to Statistical
Learning. Springer-Verlag New York. https://doi.org/10.1007/978-1-4614-
7138-7
[47] Kaleris V, Langousis A (2017) Comparison of two rainfall–runoff models: effects
of conceptualization on water budget components. Hydrological Sciences
Journal 62(5):729–748. https://doi.org/10.1080/02626667.2016.1250899
[48] Kalman RE (1960) A new approach to linear filtering and prediction problems.
Journal of Basic Engineering 82(1):35–45. https://doi.org/10.1115/1.3662552
[49] Kavetski D, Franks SW, Kuczera G (2002) Confronting Input Uncertainty in
Environmental Modelling. In: Duan Q, Gupta HV, Sorooshian S, Rousseau AN,
Turcotte R (eds) Calibration of Watershed Models. AGU, pp 49–68.
https://doi.org/10.1029/WS006p0049
[50] Klemeš V (1986) Operational testing of hydrological simulation models.
Hydrological Sciences Journal 31(1):13–24.
https://doi.org/10.1080/02626668609491024
[51] Koenker RW (2005) Quantile regression. Cambridge University Press,
Cambridge, UK
[52] Koenker RW (2017) Quantile regression: 40 years on. Annual Review of
Economics 9(1):155–176. https://doi.org/10.1146/annurev-economics-
063016-103651
[53] Koenker RW (2018) quantreg: Quantile regression. R package version 5.38.
https://CRAN.R-project.org/package=quantreg
[54] Koenker RW, Bassett Jr G (1978) Regression quantiles. Econometrica 46(1):33–
50. https://doi.org/10.2307/1913643

42
[55] Koenker RW, D'Orey V (1987) Computing regression quantiles. Journal of the
Royal Statistical Society: Series C (Applied Statistics) 36(3):383–393.
https://doi.org/10.2307/2347802
[56] Koenker RW, D'Orey V (1994) A remark on algorithm AS 229: Computing dual
regression quantiles and regression rank scores. Journal of the Royal Statistical
Society: Series C (Applied Statistics) 43(2):410–414.
https://doi.org/10.2307/2986030
[57] Koenker RW, Machado JAF (1999) Goodness of fit and related inference
processes for quantile regression. Journal of the American Statistical
Association 94(448):1296–1310.
https://doi.org/10.1080/01621459.1999.10473882
[58] Koutsoyiannis D, Montanari A (2015) Negligent killing of scientific concepts: the
stationarity case. Hydrological Sciences Journal 60(7–8):1174–1183.
https://doi.org/10.1080/02626667.2014.959959
[59] Krzysztofowicz R (1987) Markovian forecast processes. Journal of the American
Statistical Association 82(397):31–37.
https://doi.org/10.1080/01621459.1987.10478387
[60] Krzysztofowicz R (1997) Transformation and normalization of variates with
specified distributions. Journal of Hydrology 1997(1–4):286–292.
https://doi.org/10.1016/S0022-1694(96)03276-3
[61] Krzysztofowicz R (1999) Bayesian theory of probabilistic forecasting via
deterministic hydrologic model. Water Resources Research 35(9):2739–2750.
https://doi.org/10.1029/1999WR900099
[62] Krzysztofowicz R (2001) The case for probabilistic forecasting in hydrology.
Journal of Hydrology 249(1–4):2–9. https://doi.org/10.1016/S0022-
1694(01)00420-6
[63] Krzysztofowicz R (2002) Bayesian system for probabilistic river stage
forecasting. Journal of Hydrology 268:16–40. https://doi.org/10.1016/S0022-
1694(02)00106-3
[64] Krzysztofowicz R, Kelly KS (2000) Hydrologic uncertainty processor for
probabilistic river stage forecasting. Water Resources Research 36:3265–3277.
https://doi.org/10.1029/2000WR900108
[65] Kuczera G, Kavetski D, Franks S, Thyer M (2006) Towards a Bayesian total error
analysis of conceptual rainfall-runoff models: Characterising model error using
storm-dependent parameters. Journal of Hydrology 331(1–2):161–177.
https://doi.org/10.1016/j.jhydrol.2006.05.010
[66] Langousis A, Mamalakis A, Puliga M, Deida R (2016) Threshold detection for the
generalized Pareto distribution: Review of representative methods and
application to the NOAA NCDC daily rainfall database. Water Resources
Research 52(4):2659–2681. https://doi.org/10.1002/2015WR018502
[67] Li W, Duan Q, Miao C, Ye A, Gong W, Di Z (2017) A review on statistical
postprocessing methods for hydrometeorological ensemble forecasting. Wiley
Interdisciplinary Reviews: Water 4(6):e1246.
https://doi.org/10.1002/wat2.1246
[68] Lichtendahl Jr KC, Grushka-Cockayne Y, Winkler RL (2013) Is it better to
average probabilities or quantiles?. Management Science 59(7):1594–1611.
https://doi.org/10.1287/mnsc.1120.1667

43
[69] Lidén R, Harlin J (2000) Analysis of conceptual rainfall–runoff modelling
performance in different climates. Journal of Hydrology 238(3–4):231–247.
https://doi.org/10.1016/S0022-1694(00)00330-9
[70] López López P, Verkade JS, Weerts AH, Solomatine DP (2014) Alternative
configurations of quantile regression for estimating predictive uncertainty in
water level forecasts for the upper Severn River: a comparison. Hydrology and
Earth System Sciences 18:3411–3428. https://doi.org/10.5194/hess-18-3411-
2014
[71] Mayr A, Binder H, Gefeller O, Schmid M (2014) The evolution of boosting
algorithms. Methods of Information in Medicine 53(06):419–427.
https://doi.org/10.3414/ME13-01-0122
[72] Meinshausen N (2006) Quantile regression forests. Journal of Machine Learning
Research 7:983–999
[73] Messner JW (2018) Chapter 11 - Ensemble Postprocessing With R. In:
Vannitsem S, Wilks DS, Messner JW (eds) Statistical Postprocessing of Ensemble
Forecasts. Elsevier, pp 291–329. https://doi.org/10.1016/B978-0-12-812372-
0.00011-X
[74] Michel C (1991) Hydrologie appliquée aux petits bassins ruraux. Cemagref,
Antony, France
[75] Microsoft, Weston S (2017) foreach: Provides foreach looping construct for R. R
package version 1.4.4. https://CRAN.R-project.org/package=foreach
[76] Microsoft Corporation, Weston S (2018) doParallel: Foreach parallel adaptor for
the 'parallel' package. R package version 1.0.14. https://CRAN.R-
project.org/package=doParallel
[77] Min C, Zellner A (1993) Bayesian and non-Bayesian methods for combining
models and forecasts with applications to forecasting international growth
rates. Journal of Econometrics 56(1–2):89–118. https://doi.org/10.1016/0304-
4076(93)90102-B
[78] Montanari A (2011) 2.17 - Uncertainty of Hydrological Predictions. In: Wilderer
P (ed) Treatise on Water Science. Elsevier, pp 459–478.
https://doi.org/10.1016/B978-0-444-53199-5.00045-2
[79] Montanari A, Brath A (2004) A stochastic approach for assessing the
uncertainty of rainfall-runoff simulations. Water Resources Research
40(1):W01106. https://doi.org/10.1029/2003WR002540
[80] Montanari A, Grossi G (2008) Estimating the uncertainty of hydrological
forecasts: A statistical approach. Water Resources Research 44(12):W00B08.
https://doi.org/10.1029/2008WR006897
[81] Montanari A, Koutsoyiannis D (2012) A blueprint for process-based modeling of
uncertain hydrological systems. Water Resources Research 48(9):W09555.
https://doi.org/10.1029/2011WR011412
[82] Mouelhi S, Michel C, Perrin C, Andréassian V (2006a) Stepwise development of a
two-parameter monthly water balance model. Journal of Hydrology 318(1–
4):200–214. https://doi.org/10.1016/j.jhydrol.2005.06.014
[83] Mouelhi S, Michel C, Perrin C, Andréassian V (2006b) Linking stream flow to
rainfall at the annual time step: the Manabe bucket model revisited. Journal of
Hydrology 328(1–2):283–296. https://doi.org/10.1016/j.jhydrol.2005.12.022
[84] Nash JE, Sutcliffe JV (1970) River flow forecasting through conceptual models
part I — A discussion of principles. Journal of Hydrology 10(3):282–290.
https://doi.org/10.1016/0022-1694(70)90255-6

44
[85] Natekin A, Knoll A (2013) Gradient boosting machines, a tutorial. Frontiers in
Neurorobotics 7:21. https://doi.org/10.3389/fnbot.2013.00021
[86] Newman AJ, Sampson K, Clark MP, Bock A, Viger RJ, Blodgett D (2014) A large-
sample watershed-scale hydrometeorological dataset for the contiguous USA.
Boulder, CO: UCAR/NCAR. https://doi.org/10.5065/D6MW2F4D
[87] Newman AJ, Clark MP, Sampson K, Wood A, Hay LE, Bock A, Viger RJ, Blodgett D,
Brekke L, Arnold JR, Hopson T, Duan Q (2015) Development of a large-sample
watershed-scale hydrometeorological data set for the contiguous USA: data set
characteristics and assessment of regional variability in hydrologic model
performance. Hydrology and Earth System Sciences 19:209–223.
https://doi.org/10.5194/hess-19-209-2015
[88] Newman AJ, Mizukami N, Clark MP, Wood AW, Nijssen B, Nearing G (2017)
Benchmarking of a physically based hydrologic model. Journal of
Hydrometeorology 18:2215–2225. https://doi.org/10.1175/JHM-D-16-0284.1
[89] Nikolopoulos EI, Destro E, Bhuiyan MAE, Borga M, Anagnostou EN (2018)
Evaluation of predictive models for post-fire debris flow occurrence in the
western United States. Natural Hazards and Earth System Sciences 18:2331–
2343. https://doi.org/10.5194/nhess-18-2331-2018
[90] Oshiro TM, Perez PS, Baranauskas JA (2012) How many trees in a random
forest?. In: Perner P (ed) Machine Learning and Data Mining in Pattern
Recognition (Lecture Notes in Computer Science). Springer-Verlag Berlin
Heidelberg, IBaI, Leipzig, Germany, 2012; Volume 7376, pp 154–168.
https://doi.org/10.1007/978-3-642-31537-4
[91] Ouali D, Chebana F, Ouarda TBMJ (2016) Quantile regression in regional
frequency analysis: A better exploitation of the available information. Journal of
Hydrometeorology 17:1869–1883. https://doi.org/10.1175/JHM-D-15-0187.1
[92] Oudin L, Hervieu F, Michel C, Perrin C, Andréassian V, Anctil F, Loumagne C
(2005) Which potential evapotranspiration input for a lumped rainfall–runoff
model?: Part 2—Towards a simple and efficient potential evapotranspiration
model for rainfall–runoff modelling. Journal of Hydrology 303(1–4):290–306.
https://doi.org/10.1016/j.jhydrol.2004.08.026
[93] Pagano TC, Shrestha DL, Wang QJ, Robertson D, Hapuarachchi P (2013)
Ensemble dressing for hydrological applications. Hydrological Processes
27(1):106–116. https://doi.org/10.1002/hyp.9313
[94] Papacharalampous G, Tyralis H (2018) Evaluation of random forests and
Prophet for daily streamflow forecasting. Advances in Geosciences 45:201–208.
https://doi.org/10.5194/adgeo-45-201-2018
[95] Papacharalampous G, Tyralis H, Koutsoyiannis D (2018a) One-step ahead
forecasting of geophysical processes within a purely statistical framework.
Geoscience Letters 5(12). https://doi.org/10.1186/s40562-018-0111-1
[96] Papacharalampous G, Tyralis H, Koutsoyiannis D (2018b) Predictability of
monthly temperature and precipitation using automatic time series forecasting
methods. Acta Geophysica 66(4):807–831. https://doi.org/10.1007/s11600-
018-0120-7
[97] Papacharalampous G, Tyralis H, Koutsoyiannis D (2018c) Univariate time series
forecasting of temperature and precipitation with a focus on machine learning
algorithms: A multiple-case study from Greece. Water Resources Management
32(15):5207–5239. https://doi.org/10.1007/s11269-018-2155-6

45
[98] Papacharalampous G, Tyralis H, Koutsoyiannis D (2019a) Comparison of
stochastic and machine learning methods for multi-step ahead forecasting of
hydrological processes. Stochastic Environmental Research and Risk
Assessment 33(2):481–514. https://doi.org/10.1007/s00477-018-1638-6
[99] Papacharalampous G, Koutsoyiannis D, Montanari A (2019b) Quantification of
predictive uncertainty in hydrological modelling by harnessing the wisdom of
the crowd: Methodology development and investigation using toy models.
https://doi.org/10.13140/RG.2.2.32868.22401
[100] Papacharalampous G, Tyralis H, Koutsoyiannis D, Montanari A (2019c)
Quantification of predictive uncertainty in hydrological modelling by harnessing
the wisdom of the crowd: A large–sample experiment at monthly timescale.
https://doi.org/10.13140/RG.2.2.16091.00801
[101] Perrin C, Michel C, Andréassian V (2001) Does a large number of parameters
enhance model performance? Comparative assessment of common catchment
model structures on 429 catchments. Journal of Hydrology 242(3–4):275–301.
https://doi.org/10.1016/S0022-1694(00)00393-0
[102] Perrin C, Michel C, Andréassian V (2003) Improvement of a parsimonious model
for streamflow simulation. Journal of Hydrology 279(1–4):275–289.
https://doi.org/10.1016/S0022-1694(03)00225-7
[103] Peterson RA (2018) bestNormalize: Normalizing transformation functions. R
package version 1.3.0. https://CRAN.R-project.org/package=bestNormalize
[104] Probst P, Boulesteix AL (2018) To tune or not to tune the number of trees in
random forest. Journal of Machine Learning Research 18(181):1–18
[105] R Core Team (2019) R: A language and environment for statistical computing. R
Foundation for Statistical Computing, Vienna, Austria. https://www.R-
project.org/
[106] Raftery AE, Madigan D, Hoeting JA (1997) Bayesian model averaging for linear
regression models. Journal of the American Statistical Association 92(437):179–
191. https://doi.org/10.1080/01621459.1997.10473615
[107] Raftery AE, Gneiting T, Balabdaoui F, Polakowski M (2005) Using Bayesian
model averaging to calibrate forecast ensembles. Monthly Weather Review
133:1155–1174. https://doi.org/10.1175/MWR2906.1
[108] Ranjan R, Gneiting T (2010) Combining probability forecasts. Journal of the
Royal Statistical Society: Series B (Statistical Methodology) 72(1):71–91.
https://doi.org/10.1111/j.1467-9868.2009.00726.x
[109] Reinsel G (1979) Maximum likelihood estimation of stochastic linear difference
equations with autoregressive moving average errors. Econometrica
47(1):129–151. https://doi.org/10.2307/1912351
[110] Rigby RA, Stasinopoulos DM (2005) Generalized additive models for location,
scale and shape. Journal of the Royal Statistical Society: Series C (Applied
Statistics) 54(3):507–554. https://doi.org/10.1111/j.1467-9876.2005.00510.x
[111] Sagi O, Rokach L (2018) Ensemble learning: A survey. Wiley Interdisciplinary
Reviews: Data Mining and Knowledge Discovery 8(4):e1249.
https://doi.org/10.1002/widm.1249
[112] Scornet E, Biau G, Vert JP (2015) Consistency of random forests. The Annals of
Statistics 43(4):1716–1741. https://doi.org/10.1214/15-AOS1321

46
[113] Seo DJ, Herr HD, Schaake JC (2006) A statistical post-processor for accounting of
hydrologic uncertainty in short-range ensemble streamflow prediction.
Hydrology and Earth System Sciences Discussions 3:1987–2035.
https://doi.org/10.5194/hessd-3-1987-2006
[114] Shastri H, Ghosh S, Karmakar S (2017) Improving global forecast system of
extreme precipitation events with regional statistical model: Application of
quantile-based probabilistic forecasts. Journal of Geophysical Research
122(3):1617–1634. https://doi.org/10.1002/2016JD025489
[115] Smyth P, Wolpert D (1999) Linearly combining density estimators via stacking.
Machine Learning 36(1–2):59–83. https://doi.org/10.1023/A:1007511322260
[116] Solomatine DP, Wagener T (2011) 2.16 - Hydrological Modeling. In: Wilderer P
(ed) Treatise on Water Science. Elsevier, pp 435–457.
https://doi.org/10.1016/B978-0-444-53199-5.00044-0
[117] Taillardat M, Mestre O, Zamo M, Naveau P (2016) Calibrated ensemble forecasts
using quantile regression forests and ensemble model output statistics. Monthly
Weather Review 144:2375–2393. https://doi.org/10.1175/MWR-D-15-0260.1
[118] Taylor JW (2000) A quantile regression neural network approach to estimating
the conditional density of multiperiod returns. Journal of Forecasting
19(4):299–311. https://doi.org/10.1002/1099-131X(200007)19:4<299::AID-
FOR775>3.0.CO;2-V
[119] Thornton PE, Thornton MM, Mayer BW, Wilhelmi N, Wei Y, Devarakonda R,
Cook RB (2014) Daymet: Daily surface weather data on a 1-km grid for North
America, version 2. ORNL DAAC, Oak Ridge, Tennessee, USA. Date accessed:
2016/01/20. https://doi.org/10.3334/ORNLDAAC/1219
[120] Tibshirani J, Athey S, Wager S (2018) grf: Generalized random forests (beta). R
package version 0.10.2. https://CRAN.R-project.org/package=grf
[121] Todini E (2007) Hydrological catchment modelling: Past, present and future.
Hydrology and Earth System Sciences 11:468–482.
https://doi.org/10.5194/hess-11-468-2007
[122] Toth E, Montanari A, Brath A (1999) Real-time flood forecasting via combined
use of conceptual and stochastic models. Physics and Chemistry of the Earth,
Part B: Hydrology, Oceans and Atmosphere 24(7):793–798.
https://doi.org/10.1016/S1464-1909(99)00082-9
[123] Trapero JR, Cardós M, Kourentzes N (2019) Quantile forecast optimal
combination to enhance safety stock estimation. International Journal of
Forecasting 35(1):239–250. https://doi.org/10.1016/j.ijforecast.2018.05.009
[124] Tyralis H, Koutsoyiannis D (2014) A Bayesian statistical model for deriving the
predictive distribution of hydroclimatic variables. Climate Dynamics 42(11–
12):2867–2883. https://doi.org/10.1007/s00382-013-1804-y
[125] Tyralis H, Koutsoyiannis D (2017) On the prediction of persistent processes
using the output of deterministic models. Hydrological Sciences Journal
62(13):2083–2102. https://doi.org/10.1080/02626667.2017.1361535
[126] Tyralis H, Papacharalampous G (2017) Variable selection in time series
forecasting using random forests. Algorithms 10(4):114.
https://doi.org/10.3390/a10040114
[127] Tyralis H, Papacharalampous G (2018) Large-scale assessment of Prophet for
multi-step ahead forecasting of monthly streamflow. Advances in Geosciences
45:147–153. https://doi.org/10.5194/adgeo-45-147-2018

47
[128] Tyralis H, Dimitriadis P, Koutsoyiannis D, O'Connell PE, Tzouka K, Iliopoulou T
(2018) On the long-range dependence properties of annual precipitation using a
global network of instrumental measurements. Advances in Water Resources
111:301–318. https://doi.org/10.1016/j.advwatres.2017.11.010
[129] Tyralis H, Papacharalampous G, Langousis A (2019a) A brief review of random
forests for water scientists and practitioners and their recent history in water
resources. Water 11(5):910. https://doi.org/10.3390/w11050910
[130] Tyralis H, Papacharalampous G, Tantanee S (2019b) How to explain and predict
the shape parameter of the generalized extreme value distribution of
streamflow extremes using a big dataset. Journal of Hydrology 574:628–645.
https://doi.org/10.1016/j.jhydrol.2019.04.070
[131] Verikas A, Gelzinis A, Bacauskiene M (2011) Mining data with random forests: A
survey and results of new tests. Pattern Recognition 44(2):330–349.
https://doi.org/10.1016/j.patcog.2010.08.011
[132] Vrugt JA, Robinson BA (2007) Treatment of uncertainty using ensemble
methods: Comparison of sequential data assimilation and Bayesian model
averaging. Water Resources Research 43(1):W01411.
https://doi.org/10.1029/2005WR004838
[133] Waldmann E (2018) Quantile regression: A short story on how and why.
Statistical Modelling 18(3–4):203–218.
https://doi.org/10.1177/1471082X18759142
[134] Wang Y, Zhang N, Tan Y, Hong T, Kirschen DS, Kang C (2019) Combining
probabilistic load forecasts. IEEE Transactions on Smart Grid 10(4):3664–3674.
https://doi.org/10.1109/TSG.2018.2833869
[135] Warnes GR, Bolker B, Gorjanc G, Grothendieck G, Korosec A, Lumley T,
MacQueen D, Magnusson A, Rogers J (2017) gdata: Various R programming
tools for data manipulation. R package version 2.18.0. https://CRAN.R-
project.org/package=gdata
[136] Weerts AH, Winsemius HC, Verkade JS (2011) Estimation of predictive
hydrological uncertainty using quantile regression: Examples from the national
flood forecasting system (England and Wales). Hydrology and Earth System
Sciences 15:255–265. https://doi.org/10.5194/hess-15-255-2011
[137] Weijs SV, Schoups G, Van de Giesen N (2010) Why hydrological predictions
should be evaluated using information theory. Hydrology and Earth System
Sciences 14:2545–2558. https://doi.org/10.5194/hess-14-2545-2010
[138] Wickham H (2007) Reshaping data with the reshape package. Journal of the
Statistical Software 21(12). https://doi.org/10.18637/jss.v021.i12
[139] Wickham H (2016) ggplot2. Springer-Verlag New York.
https://doi.org/10.1007/978-0-387-98141-3
[140] Wickham H (2017) reshape2: Flexibly reshape data: A reboot of the reshape
package. R package version 1.4.3. https://CRAN.R-
project.org/package=reshape2
[141] Wickham H (2019) stringr: Simple, consistent wrappers for common string
operations. R package version 1.4.0. https://CRAN.R-
project.org/package=stringr
[142] Wickham H, Hester J, Francois R (2018) readr: Read rectangular text data. R
package version 1.3.1. https://CRAN.R-project.org/package=readr

48
[143] Wickham H, Chang W, Henry L, Pedersen TL, Takahashi K, Wilke C, Woo K
(2019a) ggplot2: Create elegant data visualisations using the grammar of
graphics. R package version 3.1.1. https://CRAN.R-project.org/package=ggplot2
[144] Wickham H, François R, Henry L, Müller K (2019b) dplyr: A grammar of data
manipulation. R package version 0.8.0.1. https://CRAN.R-
project.org/package=dplyr
[145] Wickham H, Hester J, Chang W (2019c) devtools: Tools to make developing R
packages easier. R package version 2.0.2. https://CRAN.R-
project.org/package=devtools
[146] Winkler RL (1972) A decision-theoretic approach to interval estimation. Journal
of the American Statistical Association 67(337):187–191.
https://doi.org/10.1080/01621459.1972.10481224
[147] Wolpert DH (1992) Stacked generalization. Neural Networks 5(2):241–259.
https://doi.org/10.1016/S0893-6080(05)80023-1
[148] Xie Y (2014) knitr: A Comprehensive Tool for Reproducible Research in R. In:
Stodden V, Leisch F, Peng RD (Eds) Implementing Reproducible Computational
Research. Chapman and Hall/CRC
[149] Xie Y (2015) Dynamic Documents with R and knitr, 2nd edition. Chapman and
Hall/CRC
[150] Xie Y (2019) knitr: A general-purpose package for dynamic report generation in
R. R package version 1.22. https://CRAN.R-project.org/package=knitr
[151] Xu L, Chen N, Zhang X, Chen Z (2018) An evaluation of statistical, NMME and
hybrid models for drought prediction in China. Journal of Hydrology 566:235–
249. https://doi.org/10.1016/j.jhydrol.2018.09.020
[152] Yan J, Liao GY, M Gebremichael, Shedd R, Vallee DR (2014) Characterizing the
uncertainty in river stage forecasts conditional on point forecast values. Water
Resources Research 48(12):W12509. https://doi.org/10.1029/2012WR011818
[153] Yao Y, Vehtari A, Simpson D, Gelman A (2018) Using stacking to average
Bayesian predictive distributions. Bayesian Analysis 13(3):917–1003.
https://doi.org/10.1214/17-BA1091
[154] Ye A, Duan Q, Yuan X, Wood EF, Schaake J (2014) Hydrologic post-processing of
MOPEX streamflow simulations. Journal of Hydrology 508:147–156.
https://doi.org/10.1016/j.jhydrol.2013.10.055
[155] Yu B, Xu Z (2008) A comparative study for content-based dynamic spam
classification using four machine learning algorithms. Knowledge-Based
Systems 21(4):355–362. https://doi.org/10.1016/j.knosys.2008.01.001
[156] Zhao L, Duan Q, Schaake J, Ye A, Xia J (2011) A hydrologic post-processor for
ensemble streamflow predictions. Advances in Geosciences 29:51–59.
https://doi.org/10.5194/adgeo-29-51-2011

49
Conflicts of interest: The authors declare no conflict of interest.

50
Highlights

 Probabilistic forecasts are combined using weighted stacked generalization.


 Quantile regression and quantile regression forests are used as base-learners.
 Ensemble learning (EL) is used to postprocess hydrological model simulations.
 EL performance is assessed based on 511 time series of daily streamflows in CONUS.
 Average relative improvement over the two base-learners is approximately 6%.

51

Вам также может понравиться