Академический Документы
Профессиональный Документы
Культура Документы
Abstract—This paper presents a way to use the recurrent curve of the manufacturer and the wind turbine control change,
quadratic Volterra system to forecast the wind power output. The so the forecasting models must also change. Furthermore,
recurrent quadratic Volterra system is a second-order polynomial since the RNN uses a non-polynomial activation function, it is
equation that uses the output data as feedback recursively. The
Volterra system is extracted from the weights of the Recurrent hard to measure the nonlinearity of wind power output. Third,
Neural Network. During this process, three innovative techniques it is hard to check the stability of the RNN. Since the RNN
are used. In order to make Volterra kernels from the combination uses the output as feedback, the asymptotic stability should be
of weights, the activation function is approximated to the high- checked, because an unstable RNN cannot forecast the multi-
order polynomial function by using the Lagrangian interpolation. step ahead forecasts. Finally, the RNN can be easily over-fitted
Furthermore, the memory of the Volterra system is also identified
using the Partial Autocorrelation Function. After building the to the data.
Volterra system, the 15 and 30-minutes ahead of wind power out- These limits could be mitigated by using the Volterra
put is forecasted with confidence intervals at the 95% confidence system that has previously been used to describe the nonlinear
level. The confidence interval is calculated using the multi-linear systems [8]. Volterra kernels can express the nonlinearity and
regression techniques. The stability of the recurrent Volterra model changes as the wind speed varies. We can also find the
system is also considered by the heuristic method.
dominant Volterra kernels using the genomic algorithm in [9],
Index Terms—Volterra system, Recurrent Neural Network, so we can further reduce the model parameters. Furthermore,
Wind Power System, Short-Term Forecasting it is easier to check the stability of the Volterra system than
the RNN since the Volterra kernels can be represented as the
I. I NTRODUCTION linear state space model. The second method of Lyapunov can
be used to check the stability of the Volterra system [10]. With
introduced and classified with respect to various standards. x(n) = h0 + h1 (k1 )x(n − k1 )
k1 =1
In addition, the overview of the recurrent quadratic Volterra
system is described including the number of Volterra kernels.
M
M
+ h2 (k1 , k2 )x(n − k1 )x(n − k2 )
k1 =0 k2 =1
A. Wind Power Forecasting Classifications
M
M
+ ··· hp (k1 , . . . , kp )x(n − k1 ) · · · x(n − kp )
Wind forecasting models can be classified into the re- k1 =1 kp =1
gression and recurrent models according to the structure of
forecasting models. The recurrent model receives output as + e(t) n = M + 1, M + 2, · · ·
feedback. The RNN and time-series models belong to this (1)
approach. Forecasting models in this category are usually used where h0 is the constant, which is zero in this case, and
for the short-term forecasting. where hi (k1 , . . . , ki ) is the set of Pth order Volterra kernel
The regression model uses a function between the ex- coefficients. In order to reduce the computational complexity,
ogenous input and the output data. This model receives the the kernels should be assumed as
geographical and meteorological information, such as wind
speed and temperature, as input. Even though the regression hp (k1 , . . . , kP ) = 0 if k1 > · · · > kP (2)
model can describes the relationship between the exogenous
Since the kernels, hi , can be assumed to be a symmetric func-
input and the wind power output, it cannot fully describe the
tion with respect to all permutations of the indices ki , . . . , kP ,
dynamic characteristic of a wind power system. Furthermore,
one kernel per each permutation is enough to describe the
for the regression model, in order to forecast the wind power
Volterra system. The other kernels becomes zero. The input
output, the input data should be forecasted before anything
signal is Xn−1 = {x(n − 1), x(n − 2), . . . , x(n − M )}, and
else. The input data is usually predicted by a Numerical
the output signal is the x(n). All signals and kernels are real
Weather Prediction (NWP) model [13]. The time series model,
numbers. Since it is difficult to classify errors in the truncated
NN, and Kalman filter fall into this category, but the most
Volterra model into natural system errors and higher order term
powerful regression model is the power curve of wind turbine
errors, we assumed that all errors originated from mismatches
manufacturers [5].
with the natural system. In other words, higher order terms
In addition, wind forecasting models can be classified into
are absorbed into the Volterra system errors.
the stochastic model and the deterministic model with respect
Although the Volterra system has been considered a power-
to the assumptions about data [14]. In the stochastic model,
ful model for analyzing nonlinear systems, the large number
the data is assumed to follow the stochastic process based
of kernels and their identification have been problematic. The
on the probability distribution. The predicted values usually
number of Volterra kernels of each order can be calculated by
stay around the mean value. Since wind power output is
using the Combination with a Repetition. The equation is as
not stationary in the mean, these stochastic models are not
follows
good at long-term prediction of wind power output. Therefore,
these stochastic models are also used for short-term prediction. (M + P )!
(3)
Time-series models such as the autoregressive (AR) and the P !(M )!
moving average (MA) models belong to this category. The number of Volterra kernels increases exponentially as the
In the non-stochastic model, the data is decided by deter- order and memory increases. For example, the number of third
ministic protocols. The Least Mean Square (LMS), Recursive order Volterra kernels with a memory of 20 is 1,771. This is
Least Square (RLS), Volterra system and NN are in this cat- a huge and impractical amount of data, so in our work, the
egory. Many literature reviews of various forecasting models order is limited to the second order, and the memory is limited
can be found in [15]. to 20.
Wind power output in Brazos wind farm Sample Partial Autocorrelation Function (PACF)
60
58
0.8
56 Partial correlation coefficients
54 0.6
52
MW
0.4
Confidence bound (95%)
50 Test Data
48 0.2
46
Training Data (80%) & 0
44 Validation Data (20%)
42 −0.2
0 100 200 300 400 500 600 700 800 900 1000 0 5 10 15 20 25 30 35 40
Minute Memory
Fig. 1. Wind power output was sampled from the Brazos wind farm in Texas. Fig. 3. Sample partial autocorrelation function of wind power output is
Data was sampled on 5/21/07 at 15:00 for 1,000 minutes. described with the confidence intervals at the 95% confidence level. The
estimated memory is five, but it is discovered that the memory of six performs
better.
data. Random division can prevent the NN from being over-
fitted to the early data. Eighty percent of the data from the
The RNN is one of the dynamic neural networks which
first 800 minutes is used to train the NN, and 20% is used to
use an output as an input after passing the time-delay. The
validate the NN. The data from the last 200 minutes is used
RNN in this paper has a three layer with one input layer, one
to test the model and forecast future output power. Then, the
hidden layer, and one output layer is used. The input layer
identification process is as follows.
receives as many input neurons as the number of memory, and
Step 1. Preprocess the wind power output data input neurons are in the tap-delayed form as shown in [11].
One hidden layer has an arbitrary number of neurons which
Step 2. Establish the NN and identify the memory.
is heuristically decided by the complexity of the data or the
Step 3. Adjust and approximate the activation function number of Volterra kernels. Since the RNN cannot always find
the optimum solution, we run the program many times per each
Step 4. Train the NN and extract the Volterra system
data with different initial values. Hidden layers receive ”net
Step 5. Forecast the wind power output input” which is the sum of the input values that are multiplied
by their corresponding weights [16]. Hidden neurons use the
Step 6. Find the confidence intervals and check the stability
tangent hyperbolic function tanh as the activation function.
Data is preprocessed in order to increase the convergence The output layer has only one neuron and uses a linear
speed and to extract the Volterra kernels accurately. Data activation function. Fig. 2 shows the overview of the RNN
preprocess consists of mean subtraction and downscaling, structure. In 2, the Z − 1 means the one-time delay, and the
and these two processes are applied to input and target data number of hidden neurons is abbreviated.
individually. The mean subtraction generates unbiased data
and helps to extract unbiased Volterra kernels. All processes
in this paper handle the mean-subtracted data. Then, the input A. Memory Identification
and target data are downscaled individually so that they have The memory is identified by using the non-parametric model
the same magnitude. In order to make input and target data identification, which is used for the AR model estimation
have a similar magnitude, we do not normalize, but downscale based on two facts. First, the number of nonlinear terms in
the data. Downscaling data divides the data by a maximum the Volterra model depends on the memory of linear terms
absolute value and does not subtract the mean. In contrast, since the nonlinear terms are all possible combinations of the
normalizing data puts data into [−1, 1] and subtracts the linear terms. That is, in order to have nonlinear terms of higher
minimum. Normalization requires to subtract the minimum memory, the model should have linear terms of higher memory
from input data, so the NN will be trained by minimum- as well. Second, the nonlinear terms are the extra terms of
subtracted data. As a result, the biased Volterra kernels will the linear model, and most models can be approximated by
be extracted. Therefore, the wind power output is downscaled. linear models, so their involvement affects the overall model
structure less. Furthermore, since it is very hard to determine
IV. M ODEL I DENTIFICATION the sequence of the nonlinear terms, unlike orderly arranged
In this section, we identify the structure of the RNN and the linear terms, the memory is decided on the number of linear
memory. Identifying the RNN includes to build the activation terms. Therefore, we focus on analyzing the AR model of the
function and approximate to the polynomial function using the given data in order to narrow down the pool of possible model
LI. structures to the few that are most likely to be a good fit.
4
Z −1 Z −1
x(n −1) x(n −1)
Z −1 x(n − 2)
h1 (0)
x(n − 2)
Z −1 x(n − M ) h1 (M )
x 2 (n )
x ( n)
x(n)x(n 1) h2 (0,1) x ( n)
x( n − M +1)
Z −1 h2 (M , M )
x(n − M ) x( n − M +1)
x(n − M )
x 2 (n −M )
Recurrent Volterra Kernels
Extraction Recurrent
Neural Network Volterra System
Process
Fig. 2. The recurrent neural network can be converted to the Volterra system through the Volterra kernels extraction process
If the data sampled at past generate the data sampled at time wave within the confidence bounds after five terms. Around
t, the past data and present data should be correlated. This this initial guess, the pool of candidate Volterra models is built.
correlation is generally detected through the autocorrelation Candidate models are V (2, 3), V (2, 4), V (2, 5), V (2, 6), and
function (ACF) and PACF [17]. In this paper, the PACF is V (2, 7). After comparing them in a further section, the best
used to decide the memory of the Volterra system as an initial suitable model is selected.
guess. The PACF of the AR model is close to zero after the
memory term [18]. The PACF is defined in [18] as B. Activation Function
1 ρ1 ρ2 · · · ρk−2 ρ1
The activation function in the hidden layer is the tangent
ρ1 1 ρ1 · · · ρk−3 ρ2
hyperbolic function tanh. As shown in Fig. 4, the tanh
.. .. .. .. ..
. . . . . consists of the activation range and saturation range. The tanh
ρk−1 ρk−2 ρk−3 · · · ρ1 ρk in the activation range is close to a linear function, and the
ϕkk = (4) tanh in the saturation range is close to a severely nonlinear
1 ρ1 ρ2 · · · ρk−2 ρk−1
function.
ρ1 1 ρ1 · · · ρk−3 ρk2
Nonlinear signals generate the large net inputs. If net inputs
.. .. .. .. ..
. . . . . are larger than the activation range, the activation function
ρk−1 ρk−2 ρk−3 · · · ρ1 1 drives the input to the saturated number, and the neurons in
the hidden layer are saturated. A saturated RNN cannot learn
where ρk is the ACF. The ρk between Xt and Xt−k is defined the input signals well. Besides, it is hard to approximate the
in [18] as activation function in the saturation range, so it is also hard to
Cov(Xt , Xt−k ) extract Volterra kernels from a saturated RNN. Therefore, the
ρk = (5) net inputs should be located in the activation range. In order
Var(Xt ) Var(Xt−k )
to avoid the saturation ranges of the activation function, the
where the Cov(Xt , Xt−k ) is the cross-covariance between Xt linear range should be wide enough to receive large net inputs.
and Xt−k . It is defined as The linear range can be expanded by transforming the
activation function by constraints a and b as
γk = Cov(Xt , Xt−k ) = E[(Xt − μ)(Xt−k − μ)] (6)
ϕ(x) = a tanh(bx) (7)
where μ is the mean of Xt .
The PACF of wind power output is shown in Fig. 3. The constraints a and b rely on the linear range which depends
According to Fig. 3, the initial guess of memory is selected on the maximum net inputs. Since the range of net inputs is
as five since correlation coefficients dampen as the sinusoidal not known before training the NN, transforming the activation
5
Tangent sigmoid activation function extracted Volterra system is shown in Fig. 2. It is assumed that
5
Activation region all activation functions have the same Lagrangian polynomial
4
and that the number of hidden neurons is H. Furthermore,
3 the memory is assumed as M . The structure of the RNN is
2 compared to the structure of the extracted Volterra system
1 in Fig. 2. The coefficients of the 17th degree Lagrangian
Y axe
0
polynomial, a is defined as
−1 a = [a17 , a16 , . . . , a0 ]. (11)
−2 Then, the input of hth hidden neuron Sh is defined as
−3
a*tanh(b*x) Sh =wh,1 x(t − 1) + wh,2 x(t − 2) + wh,3 x(t − 3) + . . .
−4
Polynomial equation (17th orders) + wh,M−1 x(t − M + 1) + wh,M x(t − M ) + b1
−5 (12)
−15 −10 −5 0 5 10 15
X axe where wh,m is the weight of mth delayed input data in the
Fig. 4. The tangent sigmoid function is shown. It is approximated to poly-
hth neuron. Given a and Sh , the output of hth polynomial Ih
nomial equation by the Lagrangian Interpolation. Approximated polynomial is defined as
equation is also shown as a dotted line.
Ih = aT × S (13)
17 16 0
where S = [Sh , Sh , . . . , Sh ]. Ih consists of the output
function and training the NN should be performed recursively.
vector of the hidden neurons I, which is defined as I =
In this paper, ϕ(x) has ϕ(−2) = −1 and ϕ(2) = 1. The
[IH , IH−1 , . . . , I1 ]. Then, after passing the output layer, the
constraint a is determined as 1. Then, the constraint b is found
output of RNN x(n) at time stamp n is calculated as
by
x(n) = WT × I(n) (14)
1 1 + max(Y)/a
b= log (8)
2 max(X) 1 − max(Y)/a where W is the weights of the output layer and is defined as
Fig. 4 shows the activation function a tanh(bx) and its ap- W = [WH , WH−1 , . . . , W1 ].
proximated polynomial function. In this case, the activation Since we want to build the quadratic Volterra system, we
function between 6.15 and −6.15 is approximated. The acti- should extract linear and quadratic terms from (13). For the
vation function and its approximated polynomial function vary dth degree and for d ≥ 2, the Sh d can be described as
with respect to the complexity of the data. Shd = [wh,1 x(t − 1) + wh,2 x(t − 2) + wh,3 x(t − 3) + . . .
We approximate the tanh to a polynomial equation by using
+ wh,M−1 x(t − M + 1) + wh,M x(t − M ) + b1 ]d
the LI. The LI finds a polynomial equation of the least degree
with sampled data. Since the least-degree is less than the ≈ wh,1 x(t − 1)bd−1
1 + wh,2 x(t − 2) bd−1
1 + ...
number of sampled data, we confine the number of samples + wh,M−1 x(t − M + 1)bd−11 + wh,M x(t − M ) bd−1
1
into practical degree to confine the degree of the polynomial.
d
The data were equally sampled within the linear range of + [wh,1 x(t − 1)]2 bd−2
1
2
the activation function. In this paper, the number of samples
d
is 18, and the degree of the polynomial become 17th. The + [wh,2 x(t − 2)]2 bd−2
1 + ...
approximated polynomial function L(x) is defined as 2
d
O + [wh,M x(t − M )]2 bd−2
1
L(x) = f (xi ) li (x) (9) 2
i=1 d
+2 [wh,1 x(t − 1)wh,2 x(t − 2)] bd−2
1
where the f (x) is the target function, which is the tanh in 2
this paper, and the O is the number of samples. The li (x) is d
+2 [wh,1 x(t − 1)wh,3 x(t − 3)] bd−2
1 + ...
the Lagrange basis polynomials, and it is defined as 2
x − xj d
li (x) = +2 [wh,M−1 x(t − M + 1)wh,M x(t − M )] bd−2 1
xi − xj (10) 2
0jk
j=i + H.O.T.
It should be noted that the j should be different from the i (15)
and that the same data should not be sampled. The 17th-degree If we substitute S in (13) to S which consists of the extracted
polynomial equation is shown in Fig. 4 as a dotted line. linear and second degree terms, the coefficients of inputs will
propagate from (13) to (14 and become the Volterra kernels.
V. E XTRACTION The data and Volterra kernels should be post-processed
In this section, we extract the Volterra kernels from the La- because the Volterra kernels are kernels of the downscaled
grangian polynomial and post-process them. The structure of data. Post-processing is done to convert the downscaled data
6
Linear Volterra Kernels with Memory of 6 Quadratic Volterra Kernels with Memory of 6
1.2
0.02
1
0.03 0.01
0.8 0.02
0.01
0.6 0
0
−0.01
0.4 −0.02 −0.01
−0.03
0.2 −0.04 −0.02
−0.05
0 6
5 6 −0.03
4 6
5
−0.2 3 4
1 2 3 4 5 6 7 2 3 −0.04
1 2
Memory of the Second term 1 Memory of the First term
Memory 0
(a) (b)
Fig. 5. The Volterra kernels are extracted from the RNN: (a) the linear Volterra kernels (b) the quadratic Volterra kernels
to the mean-subtracted data and the kernels of downscaled data −1 Neural Network Training Performance
10
to kernels of the mean-subtracted data. The Volterra system of
downscaled data is shown below:
Mean Square Error (MSE)
M Performance of validation data
x (n) = h0 + h1 (k1 )x(n − k1 ) Performance of training data
k1 =1
−2
M
M
(16) 10
+ h2 (k1 , k2 )x(n − k1 ) x(n − k2 )
k1 =1 k2 =1
+ e(t) n = M + 1, M + 2, · · ·
Number of training
where x (n) is the downscaled data and defined as Goal
x(t) −3
10
x(t) = (17) 0 10 20 30 40 50 60 70 80
Fx
Number of training
where the Fx is the downscaling factor and x (n) is the mean-
subtracted data. If x(t) in (17) substitutes for x(t) in (16), Fig. 6. The performance graph of the RNN is shown. The training algorithm
(16) becomes is the Conjugate Gradient Algorithm. The goal is 0.005, and the number of
training is 67.
x(n)
M
h1 (k1 )
= h0 + x(n − k1 )
Fx Fx
k1 =1
are symmetric each other. The variance of the kernels with
M
M
h2 (k1 , k2 ) (18) respect to the delays is not shown. Quadratic kernels might be
+ 2 x(n − k1 ) x(n − k2 ) distributed equally on the first and second delay field.
k1 =1 k2 =1
F x
15 Minutes Ahead Forecasted Wind Power Output 30 Minutes Ahead Forecasted Wind Power Output
62 62
Confidence Interval Test Data Test Data
60 60 Confidence Interval
58 58
Wind Farm Output (MW)
Fig. 7. The wind power output is forecasted with confidence intervals at the 95% confidence level up to (a) 15 minutes ahead (b) 30 minutes ahead