Академический Документы
Профессиональный Документы
Культура Документы
S. E. Kuznetsov
Department of Mathematics, University of Colorado,
Boulder
E-mail address: Sergei.Kuznetsov@Colorado.edu
Introduction 5
5
6
12
10
4
10 20 30 40
Sales
12 000
10 000
8000
6000
4000
2000
Months
0 20 40 60 80 100 120 140
Data
600
500
400
300
200
100
Months
20 40 60 80 100 120 140
Sunspots
150
100
50
Years
1750 1800 1850 1900
Sunspots
250
200
150
100
50
Years
1950 1960 1970 1980 1990 2000 2010
Figure 5. Sunspots (monthly data). You can see now that eleven
years cycle begins with steep rise followed by not so steep decay.
No commonly used model describes this type of behavior, this data
requires a custom made non-linear model.
this series is much harder to predict. Unless we develop a model that describes the
process from the physical point of view, we can only build an autoregressive model
based on correlations between consequent values of the series. However, if we look
at the graph, we see that the shape of the main cycle is not symmetric; the series
goes up steeply and then decays at a slower rate. This effect becomes even more
transparent if we switch to monthly data (Figure 5). Such type of behavior can be
described only by a custom made non-linear model, commonly used models can’t
explain this type of behavior.
4. Monthly averages of an exchange rate (Pound vs US $, sometime back in
eighties) ( 10 years, 120 data points, Figure 6). All series of this type are hard to
predict. In the financial world, it is considered a success if you can predict just a
sign of the increment of a series (Will it go up or down? Should we buy or should
we sell?) with probability 51.5 %.
5. Electromagnetic activity of a human brain (EEG) (1 minute, 7700 data
points, Figure 7). This particular data set was a part of a psychological experiment.
A person is waiting for 20 seconds, then a picture appears and, during the next 20
8
Exchange Rate
2.5
2.0
1.5
1.0
Months
0 20 40 60 80 100 120
EEG
1000
500
Seconds
18 19 20 21
-500
-1000
Periodogram
1.5 × 107
1.0 × 107
5.0 × 106
Frequency (Hz)
0 10 20 30 40 50 60
Figure 8. Spectrum of the EEG data. The high peak on the left
is a resonance, the peak on the right is the power frequency (50 Hz
in Europe)
seconds, the person tries to memorize it, then the picture disappears and he/she
tries to reconstruct it. The idea of the experiment was to relate the data properties
to the type of the brain activity. The project itself was unsuccessful (an attempt to
find a significant difference in the behavior of the series failed). However, spectral
analysis of the data (Figure 8) revealed the presence of two periodic components:
one component corresponds to the frequency of the power generator, the second is
a resonance.
9
Power Output
6000
5500
5000
4500
4000
Days
0 2 4 6 8 10
Power Output
6000
5500
5000
4500
4000 Days
6.0 6.5 7.0 7.5
6. Power plant data (demand). (Time step is 1 minute, 48,000 points, Figure 9).
We are interested in prediction—you should have some reserve of power, otherwise
the electric grid may get overloaded.However, you should not overdo that: A reserve
is expensive, for one thing, and also causes some trouble to the electric grids. On
Figure 9, you can clearly see daily and weekly cycles. However, as you can see on
Figure 10, some data are missing. Normally, a few missing values here and there
could be handled, they use so-called Kalman filter for that. But here, a five hours
chunk of data is missing (300 data points in a row).
7. Radioactive decay model, 450 points. This is a simulated data (Figure 11)
that has been used in order to test some estimation methods. The model behind the
data is Xt = C1 exp(−λ1 t)+C2 exp(−λ2 t)+C3 exp(−λ3 t)+noise, which corresponds
to a radioactive decay of a mixture of three radioactive substances with different
half–lives characterized by the parameters λ1 , λ2 and λ3 . The trouble here is that
none of the parameters of the model can be estimated consistently. For this data,
we have assumed that there is a large amount of a substance with short half–life,
a moderate amount of a substance with medium half–life and even smaller amount
of a substance with long half–life. So we have used the initial portion of the data
in order to estimate the first component, then tried to use the middle section of the
data in order to estimate the second component (and to refine the estimate for the
first component as well), and, finally, tried to get the third component out of the
10
Data
10
0.5
Time
0 100 200 300 400
1500
1000
500
Data
18
16
14
12
10
Years
1910 1920 1930 1940 1950 1960
Figure 13. Ocean level (annual data). Note how rounding can
kill all the information but long term trend
rest of the data. We have only managed to get the first two components, as the
last portion of the data was obscured by the noise.
8. A profile of a surface (Figure 12). Though not a time series, such data can
also be analyzed by the methods of Time Series analysis. Objective of the study
was to test a new technology (they wanted to compare characteristics of different
profiles).
9. Water level data (the ocean, Dead Sea, some lakes, Figures 13-17). Annual
data—a part of a study of the climate changes. More than a hundred of major
lakes around the globe were used. One of the problems here was that the available
data had little overlap in time.
Stationary and Non-stationary series. A series is called stationary if,
broadly speaking, its characteristics and behavior do not change in time. A sta-
tionary series still may contain cycles, that is oscillations that look periodic but
can’t be attributed to the calendar.
11
Data
600
500
400
300
200
100
Years
1910 1920 1930 1940 1950 1960
Data
300
200
100
Years
1910 1920 1930 1940 1950 1960
Data
70
60
50
40
30
20
10
Years
1880 1900 1920 1940 1960
Data
300
250
200
150
100
50
Years
1860 1870 1880 1890 1900
(day-night effects and business days - weekends effects). The series is too short to
see the annual cycle that must be there as well (and we should probably expect that
the shape of the daily cycle changes from season to season, for instance because of
the change in sunrise and sunset times).
Methods: Theoretical and Empirical. There is a wide variety of mod-
els, methods and algorithms used in time series analysis. Each model describes a
particular type of behavior of the data, and the model is not going to work if the
situation is different. It is up to researchers to choose model(s) that might work.
In a sense, models and methods of Time Series Analysis can be split into three
categories. A first group contains methods and algorithms that are based on models,
with assumptions that can be justified (statistically tested). Such models are there-
fore completely statistically and theoretically legitimate. A second group contains
algorithms, which, although based on some statistical assumptions, are commonly
used in situations that differ, somewhat or significantly, from the original assump-
tions. Finally, some algorithms are purely empirical, there is no model behind them,
they just look reasonable, and in certain situations, produce reasonable results. We
will discuss here basic, commonly used, models and algorithms.
Visual Analysis First! Your first step should be to examine the graph of the
series. It may help you to decide what kind of data you’ve got, and maybe to spot
something which should be addressed beforehand. The following data (Figure 18),
represents a daily production of cement (in a region?). Due to the nature of the
process, the production should not be affected by weekends. However, we can see
here a strange pattern in the middle of the series, two points way below (those are
Saturdays and Sundays) and one point way up (those are Mondays). We can easily
guess that somebody responsible was away for four weeks, and weekend production
was reported on Mondays. The outliers are clearly visible on the graph, but are
hard to spot otherwise. Moreover, if we do not pay attention to that and do a
correlation and/or spectral analysis (Chapters 3 and 5), we may wrongly interpret
the results as an evidence of a seven day cycle (Figures 19 and 20), which definitely
may appear in such a data as daily production (but not in this one).
Preliminary transformations. In the earlier days, a preliminary transfor-
mation was widely used in order to get a simpler model or maybe better properties
of the model (e.g. better prediction error).
13
Production Data
550
500
450
400
350
300
Days
60 80 100 120 140
ACF
1.0
0.5
Lags
1 5 10 15 20 25 30
-0.5
-1.0
Periodogram
1000
800
600
400
200
Periods
7 3.5 2.33 2
Sales
1.2 × 104
1.0 × 104
8000
6000
4000
Months
0 20 40 60 80 100 120 140
Crops Data
400
300
200
100
Years
1500 1550 1600 1650 1700 1750 1800 1850
of random or seasonal component(s) looks proportional to the data. This does not
happen often, it is more typical for a series with seasonal effects (or maybe, it is
just more visible in a series with seasonal effects). You can usually see such things
on the graph of a series. Namely, if the graph in the logarithmic scale looks more
homogeneous than the graph in the ordinary scale, then a log transformation may
be considered. However, this is not a common situation. Only two examples from
the above fit this description, and that is examples 1 and 2, dry cleaning service and
air passengers, see Figures 2 and 3. You can see here that the magnitude of seasonal
oscillation grows with time and is approximately proportional to the current value
of the trend (compare the graph of the series in the plain scale (Figure 2) and in
log scale (Figure 21)).
Another example of this type is the crops price index (annual data, 1500 - 1869
(Yes, 370 years!)). Once again, we can see that the magnitude of the random oscil-
lation grows with the trend. In log scale, the magnitude of the random oscillation
seems to be more or less constant. See Figures 22 and 23.
15
Crops Data
100
50
Years
1500 1550 1600 1650 1700 1750 1800 1850
Trend-based Models
Trend models are the simplest models that describe non-stationary series. Ac-
cording to the trend model, the observations Xt have the structure
Xt = f (t) + εt
where εt is the noise (say, independent and identically distributed (in short, i.i.d.)
random variables with Eεt = 0) and f (t), the trend, is a ‘nice’ non-random function.
Trend-based methods can be divided into two groups, depending on the length
of the data. If the series is short (few tens of points), we can only approximate
f (t) by a parametric curve, for example, fit a linear trend. If the series is long, we
typically can’t parameterize the trend, and we have to use non-parametric methods.
1. Parametric Models
A typical short series represents some annual data (economical, demographical,
etc.) Those data could be annual averages. It is also possible that the information
is available just once a year, like harvest data. For a short series, we usually fit an
appropriate parametric model, that is, we assume that f (t) is known to us up to a
few unknown parameters, for instance, that it is linear.
Rule of a thumb: As everywhere in statistics, we should have at least ten data
points per parameter.
A simplest possible model is a linear trend model. We shall discuss it in detail.
We would like to fit a linear trend model to the data X1 , . . . , XN . We assume that
(1.1) Xt = a + bt + εt
where εt are i.i.d. with zero expectation (so called white noise). We have three
unknown parameters to estimate: the coefficients a, b and the variance of the noise
σε2 . The typical way to do this is to use least squares.
It is more convenient to rewrite the equation (1.1) as
where t̄ = (N + 1)/2 and a0 = a + bt̄ (roughly speaking, we just use the midpoint
PN
of the time interval as the origin). Note that t=1 (t − t̄) = 0. According to the
method of least squares, we would like to minimize the sum
N
X
Q(a0 , b) = (Xt − a0 − b(t − t̄))2
t=1
17
18
So, we differentiate with respect to a0 and b and set the partial derivatives to zero.
Differentiating with respect to a0 , we get an equation
N N N
∂Q X
0
X
0
X
0= = −2 (Xt − a − b(t − t̄)) = −2 Xt + 2N a + 2b (t − t̄)
∂a0 t=1 t=1 t=1
(1.3)
N
X
= −2 Xt + 2N a0 ,
t=1
PN
since t=1 (t − t̄) = 0. So we have
N
1 X
(1.4) â0 = X̄ = Xt
N t=1
It can be rewritten as
N
X N
X
(t − t̄)(Xt − X̄) = b (t − t̄)2
t=1 t=1
which implies
PN
t=1 (Xt − X̄)(t − t̄)
(1.6) b̂ = PN
t=1 (t − t̄)2
or
PN PN PN
t=1 (t − t̄)Xt − X̄ t=1 (t − t̄) t=1 (t − t̄)Xt
(1.7) b̂ = PN = PN
− 2 2
t=1 (t t̄) t=1 (t − t̄)
PN
because t=1 (t − t̄) = 0.
Now, (1.4) and (1.2) imply
N N
1 X 1 X 0
â0 = Xt = (a + b(t − t̄) + εt ) = a0 + ε̄
N t=1 N t=1
PN
where ε̄ = N1 t=1 εt is the average of εt . In a similar way, from (1.7) and (1.2) we
are getting
PN PN
t=1 (t − t̄)Xt (t − t̄)(a0 + b(t − t̄) + εt )
b̂ = PN = t=1 PN ,
2 2
t=1 (t − t̄) t=1 (t − t̄)
PN PN PN
a0 t=1 (t − t̄) + b t=1 (t − t̄)2 + t=1 (t − t̄)εt
(1.8) = PN ,
2
t=1 (t − t̄)
PN PN PN
b t=1 (t − t̄)2 + t=1 (t − t̄)εt (t − t̄)εt
= PN = b + Pt=1 N
.
2 2
t=1 (t − t̄) t=1 (t − t̄)
19
and then compare it with corresponding percentiles for the t-distribution. There-
fore, if we would like to have a test with level of significance α, we should reject H0
if |t| ≥ tα/2,N −2 .
If the noise is not Gaussian, these statements become asymptotic: Z is, ap-
proximately, standard normal and t is, approximately, t-distributed. Hence, the
stationarity test is no longer precise but an approximate one.
Prediction. The model is typically used for a short term prediction. So, an
m-step ahead prediction constructed at time N , is given by the formula
(1.13) X̂N +m (N ) = â0 + b̂(N + m − t̄)
Hence, the prediction error is equal to
(1.14) XN +m − X̂N +m (N ) = a0 − â0 + (b − b̂)(N + m − t̄) + εN +m
Taking into account (1.9), we see that XN +m −X̂N +m (N ) is (approximately) normal
with zero expectation and the variance
!
2 1 (N + m − t̄)2
(1.15) σε 1 + + PN
N t=1 (t − t̄)
2
(see Problem 2). Replacing σε2 by its estimate from (1.10), we can construct an
(approximate) (1 − α)-confidence interval for XN +m (see details below):
s
0 1 (N + m − t̄)2
(1.16) â + b̂(N + m − t̄) ± tα/2,(N −2) σ̂ε 1 + + P
N (t − t̄)2
where tα,k is a percentile for the t-distribution with k degrees of freedom. In
particular, if m is comparable to N , then the width of the interval grows indicating
that the forecast is less reliable.
Example. Let the data be given by the following table:
t 1 2 3 4 5 6 7 8
Xt 1.87 11.74 3.78 0.15 3.72 −9.79 12.13 −0.81
t 9 10 11 12 13 14 15 16
Xt 4.72 −19.48 −1.06 −16.96 15.9 1.28 4.85 −6.74
t 17 18 19 20 21 22 23 24
Xt −16.41 3.36 18.38 14.5 5.41 21.2 10.77 11.43
t 25 26 27 28 29 30
Xt −5.56 4.15 25.33 25.16 10.03 11.59
We have N = 30 and t̄ = (N + 1)/2 = 15.5. From (1.4) and (1.6), â0 = 4.75467 and
b̂ = 0.509597, so the estimated trend is given by the formula
4.75467 + 0.509597(t − 15.5) = −3.14409 + 0.509597t
Next, we compute Q(â0 , b̂) = 3, 234.5 which gives σ̂ε2 = 115.518 and σ̂ε = 10.7479.
In order to test the hypothesis H0 : b = 0, we use the formula (1.12). We get
t = 2.24 and the corresponding quantile for the t statistics is 2.048, so we reject
the null hypothesis, the trend is statistically significant. In order to construct a
prediction, we just use the formula (1.13). If, for instance, we want a prediction for
five steps ahead, we get
X̂31 (30) = 12.6534, X̂32 (30) = 13.163, X̂33 (30) = 13.6726,
X̂34 (30) = 14.1822, X̂35 (30) = 14.6918
21
40
30
20
10
5 10 15 20 25 30 35
-10
-20
Confidence bounds for prediction could be obtained from (1.16). Namely, the upper
bounds are equal to
36.1626, 36.8185, 37.4826 38.1547 38.8346
and the lower bounds are
−10.8557, −10.4925, −10.1374 − 9.79025 − 9.45098.
The data and fitted trend with prediction and confidence bounds are shown on
Figure 1.
Least Squares and Maximum Likelihood. In order to be able to speak
about likelihood, we need to make assumptions about the distribution of the noise
εt . Let us assume once again that εt are independent normal random variables
with zero expectation and variance σε2 . Then Xt are also independent and normal
with expectation a0 + b(t − t̄) and variance σε2 . Therefore their joint density could
be found from the formula
N
1 1 X
fX1 ...XN (x1 , . . . , xN ) = exp{− (xt − a0 − b(t − t̄))2 }
(2π)N/2 σεN 2σε2 t=1
Substituting the data X1 , . . . , XN instead of x1 , . . . , xN , we get the likelihood func-
tion
1 Q(a0 , b)
L(a0 , b, σε2 ) = exp{− }.
(2π) N/2 N
σε 2σε2
Its logarithm therefore equals
N N Q(a0 , b)
log L(a0 , b, σε2 ) = − log(2π) − log(σε2 ) − .
2 2 2σε2
So, in order to maximize the log likelihood function, we have to minimize Q(a0 , b),
that is, find the least squares estimates for a0 and b. After that, we have to choose
σε2 that maximizes
N Q(â0 , b̂)
− log(σε2 ) − .
2 2σε2
Setting the derivative to zero, we find the estimate
Q(â0 , b̂)
σ̂ε2 =
N
22
100
80
60
40
20
5 10 15 20 25 30 35
-20
So, the maximum likelihood estimates for the coefficients a0 and b coincide with
least squares estimates. The estimate for σε2 differs from the one that we have
discussed earlier (we divide by N instead of N − 2), it is definitely biased. In a
sense, this is typical (maximum likelihood estimates for second moments are usually
biased).
Other parametric families. If the data set is long enough, we may use other
parametric curves, like polynomials (still linear in parameters, hence still a linear
regression). In some applications, we have to use models with asymptotic values.
In particular, it could be a logistic curve:
a
Xt = + εt
1 + be−ct
or so called Gompertz curve:
log Xt = a + brt + εt
where 0 < r < 1. For them, the rate of convergence to the limiting level is exponen-
tial. All models of those types are non-linear in parameters, so the estimation is no
longer an easy computational problem. However, if εt are normal, then least squares
estimates for all the parameters except the variance of the noise, still coincide with
maximum likelihood estimates.
Least squares and robustness. How sensitive is the method of least squares
to outliers or other errors? In order to illustrate that, consider the data set from
the example shown on Figure 1. Next two graphs (see Figures 2 and 3) show what
happens if we replace the second to last value by 100, and by 500 (this may look
harsh, but we are talking here about registration errors, for example, a tenfold error
(misplaced decimal point), or a zero instead of the value). For the original data, the
estimated trend is equal to −3.14409+0.509597t, for the first modification it is equal
to −8.52161 + 1.05002t, and for the second modification it is −32.4297 + 3.45269t.
On the Figure 4, you can compare the original data and the three trend lines.
One of the alternatives is to use a method of least absolute errors (LA), that
is to minimize the sum of absolute values of errors (instead of squares):
N
X
|Xt − a − bt| → min
t=1
The optimal values of the parameters could be found by the methods of linear
programming. Such estimates are not the best in the Gaussian case, but they
23
150
100
50
5 10 15 20 25 30 35
60
40
20
0
5 10 15 20 25 30
-20
Figure 4. The Data and three trend lines from the above graphs.
30
20
10
0
5 10 15 20 25 30
-10
-20
are not sensitive to data errors. For instance, for the above data and for both
modifications, the estimated trend is 1.60379 + 0.266208t (you can compare it with
the least squares trend on Figure 5). If we make X29 negative, say, −100 or −500,
then it becomes 0.110027 + 0.315999t. In fact, there are just two trend lines that
may appear if we change X29 . If X29 becomes bigger than some critical value
(whatever it is), then it is less slanted 1.60379 + 0.266208t, otherwise it is equal to
0.110027 + 0.315999t. You can see those trend lines on Figure 6.
24
40
30
20
10
0
5 10 15 20 25 30
-10
-20
Taking the expectation of both sides in (1.18) and using (1.9), we get
E Q(â0 , b̂) = (N − 2)σε2
be the residuals in the model. The residuals are still linear combinations of the
values of the noise, and therefore all et also have a joint Gaussian distribution
together with â0 and b̂. However,
1 2 1
Cov(et , â0 ) = Cov(εt , â0 ) − Cov(â0 , â0 ) − (t − t̄) Cov(b̂, â0 )) = σ − σ2 = 0
N ε N ε
(we used formula (1.4) to compute Cov(εt , â0 ) and (1.9) for the variance of â0 and
covariance of â0 and b̂). For similar reasons (this time, use (1.9) and (1.6)),
Therefore each et is independent from â0 and b̂, and so is Q(â0 , b̂) = e2t . Next,
P
let’s divide each side of (1.18) by σε2 . We have
N PN
X ε2t Q(â0 , b̂) N (â0 − a0 )2 (b̂ − b)2 t=1 (t − t̄)2
(1.19) = + +
σ2
t=1 ε
σε2 σε2 σε2
N −2 degrees of freedom, and it is independent from â0 and b̂. It is also independent
from εN +m whenever m > 0. Therefore U is independent from XN +m − X̂N +m (N )
because of (1.14). Next, random variable XN +m − X̂N +m (N ) is normal with ex-
pectation zero and variance given by the formula (1.15). Therefore
XN +m −X̂N +m (N )
r
1 (N +m−t̄)2
σε 1+ N + P(t−t̄)2
(1.20) t= q
Q(â0 ,b̂)
σε2 (N −2)
has t distribution with N − 2 degrees of freedom. However, σε2 cancels out, and
Q(â0 ,b̂)
(N −2) = σ̂ε2 . For those reasons, the expression in (1.20) actually boils down to
XN +m − X̂N +m (N )
(1.21) t= q 2
σ̂ε 1 + N1 + (N
P+m−t̄)
(t−t̄)2
26
Now, |t| does not exceed tα/2,(N −2) with probability 1 − α. Solving for XN +m and
recalling that X̂N +m (N ) = â0 + b̂(N + m − t̄), we get (1.16). If the noise is not
Gaussian, the statement becomes an asymptotical one.
Exercises
1. Verify (1.9).
2. Using (1.9), show that the variance of the prediction error (1.14) is indeed
given by (1.15).
3. For a data set (will be posted on the web), fit a linear trend, test the
hypothesis H0 : b = 0 at the 5% level of significance and construct a forecast with
95% confidence intervals for the next five values of the series.
2. Moving Averages
Suppose now that the series is long (and non-stationary). As a rule, there is no
parametric curve that approximates the whole data set. There exist several meth-
ods of forecasting of non-stationary series which will be discussed later. However,
sometimes we need just to decompose the series into a sum of a trend and a random
component. One of the ways to proceed is called the method of Moving Averages.
It is based on a local approximation by a polynomial trend. In order to estimate
the trend f (t) at time t, we fit a polynomial trend to the data on the time interval
[t − l, t + l]. The value of the fitted trend at time t is the estimate for f (t). The
degree of the polynomial and the window width l are the parameters of the method.
It looks like we have to re-do the computations completely as we move from t
to t + 1 but the reality is not so bad. To figure out what is going on, as well as why
it is a reasonable thing to do, let us suppose that the data can be represented as
Xt = f (t) + εt
where εt (noise) is i.i.d. and f (t) (trend) is “smooth”, that is, it can be locally
approximated by a polynomial.
To begin with, suppose that f (t) is locally linear—any portion of the data of
given length 2l + 1 can be treated as a series with linear trend at + b, however, the
coefficients a and b slowly change in time.
In order to compute the coefficients of the trend on the interval [t − l, t + l], we
have to minimize the sum
t+l
X
(Xs − a − bs)2
s=t−l
where ã = a + bt. Note that ã is exactly the value of the trend at t. Differentiating
with respect to ã, we get the equation
l
X
(Xt+i − ã − bi) = 0
i=−l
27
which implies
l
X l
X
Xt+i = (2l + 1)ã + b i = (2l + 1)ã
i=−l i=−l
Therefore the estimate for ã, and therefore the estimate fˆ(t) for the value of the
trend at time t, could be found from the formula
l
1 X
(2.1) ã = fˆ(t) = Xt+i
2l + 1
i=−l
So, fˆ(t) is actually just the average of the values of the series over the time interval
[t − l, t + l] (the moving average, or the moving average of the 1st order, since we
are using an approximation by a linear function). Note that we don’t actually need
the value of the coefficient b.
If we increase the degree of the polynomial, we arrive at weighted moving aver-
ages. For instance, let us assume that the function f (t) can be locally approximated
by a parabola. As above, we have to minimize the sum
t+l
X
(Xs − a − bs − cs2 )2
s=t−l
2
where ã = a + bt + ct and b̃ = b + 2ct. Once again, ã is the value of the trend at
time t, and therefore an estimate for ã is an estimate for f (t).
Differentiating with respect to ã and c, we get two equations:
l
X l
X l
X l
X
Xt+i = ã(2l + 1) + b̃ i+c i2 = ã(2l + 1) + c i2 ,
i=−l i=−l i=−l i=−l
l
X l
X l
X l
X l
X l
X
i2 Xt+i = ã i2 + b̃ i3 + c i4 = ã i2 + c i4 ,
i=−l i=−l i=−l i=−l i=−l i=−l
Pl Pl 3
since i=−l i= i=−l i = 0. Let us denote
l
X l(l + 1)(2l + 1)
I2 = i2 =
3
i=−l
l
l(l + 1)(2l + 1)(3l2 + 3l − 1)
X
I4 = i4 =
15
i=−l
l
X
St = Xt+i
i=−l
28
l
X
Zt = i2 Xt+i
i=−l
Solving for c, we get
Zt I2
c= − ã
I4 I4
which implies
l
St − I2 Zt /I4 1 X I2
ã = fˆ(t) = = (1 − i2 )Xt+i .
2l + 1 − (I2 )2 /I4 2l + 1 − (I2 )2 /I4 I4
i=−l
Therefore the estimate fˆ(t) is equal to the weighted moving average, or moving
average of the 2nd order,
i=l
X
(2.2) fˆ(t) = ai Xt+i
i=−l
with weights
(1 − i2 I2 /I4 )
(2.3) ai = , i = −l, . . . , l.
2l + 1 − (I2 )2 /I4
Yet another interpretation. Let again Xt = f (t) + εt where εt is i.i.d. and
f (t) is locally linear, and let
k=l
1 X
Yt = Xt+k
2l + 1
k=−l
where f (t) is the average of f (t) over the interval [t − l, t + l] and εt is a similar
average of the noise. If the function f is (nearly) linear, then f (t) ≈ f (t) and
1
Var(εt ) = Var(εt )
2l + 1
has been significantly reduced.
Suppose now that f (t) is locally quadratic and we are using the corresponding
weighted moving averages. Once again,
k=l
X k=l
X
Yt = ak f (t + k) + ak εt+k = f (t) + εt
k=−l k=−l
where f (t) and εt are the weighted moving averages. However, the weights are
designed in such a way that f (t) ≈ f (t) (if f is precisely quadratic, then f = f , see
Problem 4), and the variance of the new noise is equal to
X
Var(εt ) = a2i Var(εt )
which is also substantially smaller than Var(εt ). However, though the original noise
was i.i.d., the new noise is still identically distributed but no longer independent.
29
5.4
5.2
5.0
4.8
4.6
4.4
4.2
Years
1600 1650 1700 1750 1800
Comments. 1. Role of the Window Width. How does the result depend
on the window width? The bigger is l, the less weight is given to any specific
observation, and the estimated trend becomes more smooth. On the other hand, if
we increase the degree of the polynomial, the estimated trend becomes less smooth,
more sensitive to small scale fluctuations. For the following data set (logarithm
of the crops price data, about 350 points, see Figure 7), compare the first order
moving average with l = 10 (Figure 8), first order moving average with l = 25
(Figure 9) and second order moving average with l = 25 (Figure 10). Note how
relatively big values around the year 1630 are noticed by the first and the third
graphs, and practically ignored by the second one.
2. Rules of a thumb. First of all, polynomials of degrees higher than three
do not pay off. Next, the width of the window (2l + 1) should be at least 10 times
bigger than the number of parameters used (say, l should be at least 10 for a linear
trend, at least 20 for quadratic and cubic). Also, the polynomial approximation
should look plausible within the window of that size (we have to examine the graph
for that). For instance, for the data shown on Figure 11, any approximation around
the point t = 29 does not look possible. Other than that, linear approximation is
not likely and parabolic or cubic is probably okay for l = 20.
3. Drawbacks. the method requires l points before and after the point t.
Hence, for a data set X1 , . . . , XN , the estimate fˆ(t) can only be computed for time
instances between l + 1 and N − l (so called end effects). In order to handle time
instances that are close to the beginning and the end of the data set, one could fit
30
5.4
5.2
5.0
4.8
4.6
4.4
4.2
Years
1600 1650 1700 1750 1800
5.4
5.2
5.0
4.8
4.6
4.4
4.2
Years
1600 1650 1700 1750 1800
Data
4.5
4.0
3.5
3.0
Time
0 20 40 60 80
Figure 11. Coal production data. The series changes its behavior
around point t = 29, local approximation by a polynomial is not
likely around that point.
a trend line to the available section of the data (say, in order to estimate f (2), we
could fit a trend to the points X1 , . . . , X2+l ; however, this is no longer the weighted
average with the same weight coefficients). Also, the method hardly could be used
for the prediction (even if we fit a trend to the last portion of the data, we are using
only the last l + 1 data points and we are ignoring the rest of the data).
Robustness of the method and 53X smoothing. As everything based on
the least squares, the method is extremely sensitive to data errors (outliers). To
illustrate that, we have replaced one of the values in the above example (Figure 7)
31
5.4
5.2
5.0
4.8
4.6
4.4
4.2
Years
1690 1700 1710 1720 1730 1740 1750 1760
Figure 12. Data with outlier: Moving Average of the first order,
l = 10.
5.4
5.2
5.0
4.8
4.6
4.4
4.2
Years
1690 1700 1710 1720 1730 1740 1750 1760
Figure 13. Data with outlier: Moving Average of the 3rd order,
l = 25.
by 15. Results are shown on Figures 12 and 13 (compare them with Figures 8 and
10, respectively).
An alternative is to combine the method with other procedures that eliminate or
reduce outliers. One of the possibilities is the so called 53X smoothing (suggested
by Tukey). It is an heuristic procedure with no probabilistic interpretation or
motivation. However, it practically eliminates irregularities, such as outliers. It is
a three stage procedure.
First step: We define Yt as a median of 5 values Xt−2 , Xt−1 , Xt , Xt+1 , Xt+2
(that is, we re-arrange them in an increasing order, and take the middle point).
Second step: Zt is a median of Yt−1 , Yt , Yt+1 .
Third step: Finally, Wt = .25Zt−1 + .5Zt + .25Zt+1 .
The following example shows how it works:
Xt 55 54 56 52 76 113 68 59 74 64
Yt ... ... 55 56 68 68 74 68 ... ...
Zt ... ... ... 56 68 68 68 ... ... ...
Wt ... ... ... ... 65 68 . . . ... ... ...
End effects are not significant here (we are losing only four points at each end;
there exists a modification of the algorithm that allows us to handle those as well).
The 53X procedure does not significantly change the data just by itself (see
Figure 14). However, if we apply 53X smoothing before the (weighted) moving
32
5.4
5.2
5.0
4.8
4.6
4.4
4.2
Years
1690 1700 1710 1720 1730 1740 1750 1760
Figure 14. Original series (log of the price index) and the 53X smoothing.
5.4
5.2
5.0
4.8
4.6
4.4
4.2
Years
1690 1700 1710 1720 1730 1740 1750 1760
averages, it practically eliminates the effect of outliers (see Figure 15; compare it
with Figure 12).
Exercises
1. Find the formula for the weights for a cubic polynomial and compare them
with the weights given by the formula (2.3).
2. Find the formula for the weights for a 4th degree polynomial. Give the
answer in terms of I2 , I4 ,
l
l(l + 1)(2l + 1)(3l4 + 6l3 − 3l + 1)
X
6
I6 = i =
21
i=−l
and
l
l(l + 1)(2l + 1)(5l6 + 15l5 + 5l4 − 15l3 − l2 + 9l − 3)
X
I8 = i8 = .
45
i=−l
5.(a) Apply moving average (2.1) with l = 15 to a data set (will be posted
on the web) and graph the results. (b) For the same data, first, apply the 53X
procedure, and then apply the moving average to the result. Graph the results and
compare them with the previous ones.
6.(a) Apply the weighted moving average (2.2), (2.3) with l = 25 to a data
set (will be posted on the web) and graph the results. (b) For the same data,
first, apply the 53X procedure, and then apply the weighted moving average to the
result. Graph the results and compare them with the previous ones.
3. Exponential Smoothing
Exponential smoothing is a method of prediction based on a local approxi-
mation of a series by a polynomial (or, sometimes, by another parametric curve).
Though it is called “smoothing”, it has nothing to do with smoothing at all. In the
simplest form, it could be described as follows.
Suppose we have a long series and the observations have a structure Xt =
a(t) + εt where εt is the white noise, that is i.i.d. random variables with zero
expectation, and a(t) very slowly changes in time. So, the most reasonable short
term prediction at time n should be a constant, namely the last value of the trend
a(n). In order to estimate a(n), we could take an average of the recent observations,
but then we have to decide how many of them to include. Also, when another
observation becomes available, we have to do all the computations again.
Instead of that, we fit a model Xt = a + εt to the whole series, giving more
weight to recent observations. Namely, we minimize the sum
∞
X
(3.1) β k (Xn−k − a)2
k=0
where 0 < β < 1 is a discount coefficient (to outline the idea and to simplify the
formulae, we assume that the observations begin at −∞). Differentiating with
respect to a, we get
∞
X
β k (Xn−k − a) = 0
k=0
and therefore
P∞ k ∞
β X
P∞ kn−k =
X
(3.2) â(n) = k=0 (1 − β)β k Xn−k
k=0 β k=0
Hence, Xn which is the last available observation, receives a relative weight (1 − β).
34
Denote by X̂n+1 (n) a one step ahead forecast made at time n. In our case, it
is equal to â(n) and therefore it is given by the formula (3.2). We therefore have
∞
X
X̂n+1 (n) = (1 − β)β k Xn−k
k=0
∞
X
(3.3) = (1 − β)Xn + β (1 − β)β k Xn−1−k
k=0
= (1 − β)Xn + β X̂n (n − 1)
= X̂n (n − 1) + (1 − β)(Xn − X̂n (n − 1)).
Denote by â(n) and b̂(n) the solution to (3.5). Let X̂n+1 (n) = â(n) + b̂(n) be the
one step ahead prediction and let en+1 be the one step ahead prediction error
en+1 = Xn+1 − â(n) − b̂(n).
It could be shown (see the sketch below) that the estimates â(n + 1) and b̂(n + 1)
could be obtained from the previous estimates â(n) and b̂(n) and from the prediction
35
Once again, let X̂n+1 (n) = â(n) + b̂(n) + ĉ(n) be the one step ahead prediction and
let en+1 be the one step ahead prediction error
Then the estimates â(n + 1), b̂(n + 1) and ĉ(n + 1) are related to the previous
estimates â(n), b̂(n) and ĉ(n) and the prediction error en+1 by the equations
and so on. After that, we can compute the standard deviations and get the bounds
X̂n+k (n) ± 2σ̂k . The bounds are purely empirical, there is no probability associated
to them, they only show how good was the prediction in the past. Nonetheless, they
might give us an idea about how the method is working for the given data. In par-
ticular, if the prediction deviates from the actual data by more than two standard
deviation several times in a row, it might mean that the generating mechanism has
been changed and we should forget about the old data and start anew.
Practical implementation. A time series never starts at −∞. So, we use
the first observation to set up initial value(s) for the parameter(s). For instance,
in case of zero order exponential smoothing, we set â(1) = X1 . If we use the first
order exponential smoothing, then we set â(1) = X1 and b̂(1) = 0. In case of second
order exponential smoothing, we set â(1) = X1 and b̂(1) = ĉ(1) = 0. After that,
we use the corresponding renewal equations (3.4) or (3.6) or (3.8). The following
numerical example shows how it works.
Let the first four values of series be X1 = 3, X2 = 7, X3 = 6, X4 = 5, X5 = 4,
and let α = 0.1. For the zero order exponential smoothing, we set â(1) = X1 = 3.
Then our prediction for X2 equals 3, and prediction error equals e2 = X2 −â(1) = 4.
Therefore â(2) = â(1) + αe2 = 3.4, and that is our prediction for X3 . Hence, the
prediction error e3 = X3 − â(2) = 2.6 and â(3) = â(2) + αe3 = 3.66. So, our
prediction for X4 is 3.66, and the prediction error e4 = X4 − â(3) = 1.34. Next,
â(4) = â(3) + αe4 = 3.794, and that is our prediction for X5 . So, the prediction
error e5 = X5 − â(4) = 0.206. Finally, â(5) = â(4) + αe5 = 3.8146, and that is our
prediction for X6 (and for all subsequent time instances as well).
Time Xi Prediction Prediction error â(i)
1 3 ... ... 3
2 7 3 4 3.4
3 6 3.4 2.6 3.66
4 5 3.66 1.34 3.794
5 4 3.794 0.206 3.8146
6 ... 3.8146 ... ...
Zero Order Exponential Smoothing
For first order exponential smoothing, we set â(1) = X1 = 3 and b̂(1) = 0. So,
our prediction for X2 still equals 3 and the prediction error e2 still equals 4. For
that reason, â(2) = â(1) + b̂(1) + (2α − α2 )e2 = 3.76 and b̂(2) = b̂(1) + α2 e2 = 0.04.
Therefore our prediction for X3 equals â(2) + b̂(2) = 3.8 and the prediction error e3
turns out to be equal to 2.2. Once again, â(3) = â(2) + b̂(2) + (2α − α2 )e3 = 4.218
and b̂(3) = b̂(2) + α2 e3 = 0.062. So, our prediction for X4 equals â(3) + b̂(3) = 4.28
and prediction error e4 turns out to be 0.72. Hence â(4) = â(3) + b̂(3) + (2α −
α2 )e4 = 4.4168 and b̂(4) = b̂(3) + α2 e4 = 0.0692, our prediction for X5 equals
â(4) + b̂(4) = 4.486 and the prediction error e5 = −0.486. Using the iteration
formulas (3.6) for the last time, we find â(5) = â(4) + b̂(4) + (2α − α2 )e5 = 4.39366
and b̂(4) = b̂(3) + α2 e4 = 0.06434. Our prediction for X6 equals â(5) + b̂(5) =
37
4.458.If, at this time, we need a i steps ahead prediction, that is a prediction for
X5+i , then it equals â(5) + b̂(5)i. So, prediction for X7 equals 4.52234, prediction
for X8 equals 4.58668 and so on.
Concentration Data
18.0
17.5
17.0
16.5
Time
50 100 150 200
Concentration Data
18.0
17.5
17.0
16.5
16.0 Time
50 60 70 80 90 100
Figure 17. The data, one step ahead predictions and five steps
ahead prediction made at t = 96, according to the zero order ex-
ponential smoothing for the concentration data, α = 0.1.
Concentration Data
18.0
17.5
17.0
16.5
16.0 Time
50 60 70 80 90 100
Figure 18. The data, one step ahead predictions and five steps
ahead prediction made at t = 96, according to the first order ex-
ponential smoothing for the concentration data, α = 0.1.
can see the original data, the series made out of one step ahead predictions, and a
five steps ahead prediction made at time instant t = 96.
Role of the smoothing parameter α. The bigger is α, the smaller is β,
the more weight is given to recent observations, and the forecast is more sensitive
to the noise. In fact, α = (1 − β) is precisely the relative weight given to the very
last observation. The one before gets the weight αβ and so on. So if α is close
39
Concentration Data
18.0
17.5
17.0
16.5
16.0 Time
50 60 70 80 90 100
Figure 19. The data, one step ahead predictions and five steps
ahead prediction made at t = 96, according to the second order
exponential smoothing for the concentration data, α = 0.1.
Concentration Data
18.0
17.5
17.0
16.5
16.0 Time
50 60 70 80 90 100
Figure 20. The data, one step ahead predictions and five steps
ahead prediction made at t = 96, according to the second order
exponential smoothing for the concentration data, α = 0.05.
to one, we just ignore everything but the last observation, and our one step ahead
prediction is just equal to the previous value (not a good idea at all). On practice,
you should never use values of α that are bigger than 0.2 (otherwise we are giving
the last observation too much weight).
On the other hand, if α is very small, the coefficients adjust way too slow. If,
in addition, the degree of the polynomial is not adequate (for instance, the data
contains trend but we are using the zero order exponential smoothing) or if the
initial values for the coefficients are chosen in a wrong way, the model leads to large
prediction errors, the prediction looks way off the mark.
To illustrate that, we apply the second order exponential smoothing to the
concentration data (same as on Figure 19) with α = 0.05 (see Figure 20) and
α = 0.15 (Figure 21). As you can see, the bigger is α, the more is the reaction to
any random oscillations.
Choice of α. The following procedure is purely heuristic. Compute the average
of the squared one step ahead prediction error over the last half of the data (or over
the last two-thirds of the data, or over any other fixed portion):
X
Q(α) = e2t
40
Concentration Data
18.0
17.5
17.0
16.5
16.0 Time
50 60 70 80 90 100
Figure 21. The data, one step ahead predictions and five steps
ahead prediction made at t = 96, according to the second order
exponential smoothing for the concentration data, α = 0.15.
Concentration Data
18.0
17.5
17.0
16.5 Time
150 160 170 180 190
Choose the smoothing parameter α that minimizes Q(α). If the optimal α turns
out to be more than 0.2, the method is not working properly (you are giving too
much weight to just a few recent observations). You should either increase the
degree of the polynomial (say, consider first order smoothing instead of zero order
smoothing) and re-evaluate the best α or just look for other prediction methods.
Applying this idea to the concentration data, we get α = 0.092 as the best
value for the first order model (see Figure 22). For the second order model, the
best value is α = 0.072 (see Figure 23).
When does it fail to work? Exponential smoothing may fail to work if the
series suddenly changes it behavior, especially if it has sudden jumps. An example
could be seen on Figure 24. This data set represents a number of directory assistance
calls (monthly, 1962 - 1976). As you can see on the graph, in ten months (May 1973
to March 1974) the number of calls drops down about 80 percents, most likely due
to some business decisions. Before and after this period of time, we can see a linear
trend, so exponential smoothing of order 1 might work. Indeed, as you can see on
Figure 25, exponential smoothing of order 1 with α = 0.1 looks adequate before the
drop. However, predictions made after the drop begins, are, naturally, absolutely
off the mark. It takes more than 36 points (three years) for the prediction to start
making sense again. Even the very last prediction made in July 1976, five points
before the end of the data set, does not look right (it still predicts negative trend).
41
Concentration Data
18.0
17.5
17.0
16.5 Time
150 160 170 180 190
Assistance Data
Time
50 100 150
Assistance Data
0 Time
100 120 140 160 180
The only reasonable solution here would be to start everything anew just after the
drop.
There exist heuristical algorithms that allow you to detect such a situation
automatically. Basically, you can use any past section of the data to estimate a
standard deviation of a one step ahead prediction error. If, for several times in a
row, your prediction errors are all of the same sign (say, all positive) and bigger
42
The first one of them is simply a geometric series. The others may be obtained
from the first one by term-by-term differentiation.
We begin with the first order exponential smoothing. The equations (3.5) can
be re-written as
S0 (n) − G0 a + G1 b = 0
(3.9)
S1 (n) − G1 a + G2 b = 0
(3.11) ΛAn = Sn
where
G0 −G1 â(n) S0 (n)
Λ= , An = , Sn =
G1 −G2 b̂(n) S1 (n)
Now, note that
∞
X
S0 (n) = Xn + (1 − α) (1 − α)k−1 Xn−k
k=1
∞
(3.12) X
= Xn + (1 − α) (1 − α)k−1 X(n−1)−(k−1)
k=1
= Xn + (1 − α)S0 (n − 1)
43
In a similar way,
∞
X
S1 (n) = (1 − α) k(1 − α)k−1 Xn−k
k=1
∞
(3.13) X
= (1 − α) (1 + (k − 1))(1 − α)k−1 X(n−1)−(k−1)
k=1
= (1 − α)(S0 (n − 1) + S1 (n − 1))
Finally, note that
(3.14) Xn = X̂n (n − 1) + en = â(n − 1) + b̂(n − 1) + en .
Plugging (3.14) into (3.12), we get
(3.15) S0 (n) = (1 − α)S0 (n − 1) + â(n − 1) + b̂(n − 1) + en .
Once again, re-write (3.15) and (3.13) in matrix form. We get
(3.16) Sn = BSn−1 + ΓAn−1 + En
where
(1 − α) 0 1 1 e
B= , Γ= , En = n
(1 − α) (1 − α) 0 0 0
Substituting (3.11) for Sn and Sn−1 in both sides of (3.16) and solving for An , we
get the following relation between An and An−1 :
(3.17) An = Λ−1 (BΛ + Γ)An−1 + Λ−1 En
Now, direct computation shows that
2α − α2 −α2
(3.18) Λ−1 = ,
α2 −α3 /(1 − α)
which implies
−1 1 1
(3.19) Λ (BΛ + Γ) =
0 1
and
2α − α2
(3.20) Λ−1 En = en ,
α2
which makes (3.17) equivalent to (3.6).
In case of second order exponential smoothing, we need also
∞
X
S2 (n) = (1 − α)k k 2 Xn−k
k=0
and
∞
X 6 − 12α + 7α2 − α3
G3 = k 3 (1 − α)k = ,
α4
k=0
∞
X 24 − 60α + 50α2 − 15α3 + α4
G4 = k 4 (1 − α)k =
α5
k=0
44
Exercises
1. Verify (3.12), (3.13), (3.15), (3.16) and (3.17).
2. (a) Verify (3.18), (3.19) and (3.20). (b) Show that (3.17) - (3.20) imply
(3.6).
3. Verify that (3.8) really describes the second order exponential smoothing.
To this end, verify the corresponding modification of (3.11), (3.16) and (3.17), along
with (3.25), (3.26) and (3.27). Finally, show that (3.17) together with (3.26) and
(3.27) implies (3.8).
4. For the data (will be posted on the web), compute the zero order, first order
and second order exponential smoothing with α = 0.05. Graph the results. Which
one looks more reasonable to you.
5. Do the same with α = 0.1.
4. Stationarity Tests
Does the series contain a trend? There exists a number of tests that allow us
to find out if the series possibly contains trend. We discuss two of them.
Kendall rank correlation test. Let X1 , . . . , Xn be the data. Assume for a
moment that all the values Xi are different. Denote by K the number of pairs i, j
such that i < j and Xi < Xj . The quantity
4K
T = −1
n(n − 1)
is called the Kendall rank correlation. It could be easily seen that −1 ≤ T ≤
1. Also, if the series is monotone increasing, then T = 1, and if it is monotone
decreasing, then T = −1. If the data are independent and identically distributed,
then T is approximately normal with expectation 0 and the variance
2(2n + 5)
σT2 = .
9n(n − 1)
The statistics T has been designed to be sensitive to the presence of the trend, and
can be used in order to test the stationarity. For instance, if |T | > 1.96σT , then
the series possibly contains trend (level of significance of the test is about 95%).
If not all of the values Xi are different, we should define K as the number of
pairs i, j such that i < j and Xi < Xj plus one half of the number of pairs i, j
such that i < j and Xi = Xj .
Spearman rank correlation test. Let X1 , . . . , Xn be the data. Once again,
assume that all the values Xi are different. For each time instant i, let ri be the
rank of Xi , that is 1 plus the number of those j 6= i such that Xj < Xi (so if Xi is
the smallest of X1 , . . . , Xn , then its rank ri is equal to one, if it is the biggest, then
the rank is equal to n). The Spearman rank correlation is defined by the formula
Pn
6 i=1 (ri − i)2
S =1−
n(n2 − 1)
Once again, if the series is monotone increasing, then S = 1. If it is monotone
decreasing, then S = −1. It could be shown that, if the data are independent and
identically distributed, then S is approximately normal with expectation 0 and the
variance
1
σS2 = .
n
46
White Noise
30
20
10
Time
20 40 60 80 100
-10
-20
-30
If not all of the values Xi are different, we should define ri as 1 plus the
number of j 6= i such that Xj < Xi plus one half of the number of j 6= i such that
Xj = Xi . In other words, we rearrange the data in the increasing order. If a value
is not repeated, then its rank is exactly its position in the new list. If the value
is repeated several times, then each of them gets the average of the corresponding
ranks. So, for instance, if Xi is the smallest value and it is taken exactly twice,
then each of those observations gets the rank 1.5. If it is repeated six times, then
each of the values gets the rank 3.5.
Both statistics have similar meaning, and work similar. The power of the
corresponding tests is approximately equal.
Example. The following example shows how it works. Suppose the values of
the series are equal to
−4, 21, 14, 10, 22, −11, 14, 18, 14, 21, 19, 20
(n = 12). For the Kendall coefficient, we have here forty pairs such that i < j
and Xi < Xj , and four pairs such that i < j and Xi = Xj (the value 21 is taken
twice and the value 14 is taken three times). That gives us K = 42 and T = 0.273.
However, T /σT = 1.235 which is much less than Z0.025 = 1.96, so Kendall test
turns negative at 5 % level of significance.
For the Spearman coefficient, we have to find the ranks. They are equal
2, 10.5, 5, 3, 12, 1, 5, 7, 5, 10.5, 8, 9
(note that some of the ranks are repeated). That gives us S = 0.33 and S/σS =
1.145, again insignificant at 5 % level of significance.
More Examples and Comments. 1. If you can see a trend on the graph
with your naked eye, don’t even bother to do any testing. The data on Figure 26
represent simulated i.i.d. normal random variables (100 values) with zero mean and
σ = 14.
On Figures 27 and 28, you can see how little of the trend do we need to
add, to make it statistically significant: for Xt = −0.1t + noise and for Xt =
0.13t + noise, both Kendall and Spearman tests give a positive answer at the 5%
level of significance.
2. The tests are not fool-proof. If the trend is not monotone, the tests may
detect it or fail to do so. For instance, both Kendall and Spearman tests are positive
for the data shown on Figure 29 and they are negative for the data shown on Figure
30 though both series contain a trend that goes up and down. Still, both tests turn
47
Data
20
10
Time
20 40 60 80 100
-10
-20
-30
Data
20
10
Time
20 40 60 80 100
-10
20 40 60 80 100
-1
-2
positive if we apply them to an appropriate portion of the second data set, say, to
the first or last third of the series.
3. The tests are sensitive to somewhat different things and may disagree with
each other. Say, according to the Kendall test, there is no trend in the data shown
on Figure 31 and there is a trend in the data shown on Figure 32. According to
the Spearman test, the situation is opposite. And, those data sets are parts of the
same time series — the water level in Dead Sea (Figure 14 in the Introduction).
48
20 40 60 80 100
-2
-4
600
580
560
540
520
500
480
0 5 10 15 20 25
360
340
320
300
280
0 5 10 15 20
Figure 32. Water level in Dead Sea, years 1934 - 1956. Kendall
test is positive and Spearman test is negative this time.
600
550
500
450
400
350
Time
0 50 100 150 200 250 300 350
Data
20
10
Time
50 100 150 200 250 300 350
-10
-20
Data
30
20
10
Time
50 100 150 200 250 300 350
-10
-20
-30
-40
Data
1000
800
600
400
200
Time
0 50 100 150 200 250 300 350
Data
50
40
30
20
10
Time
0 50 100 150 200 250 300 350
take the absolute values (Figure 37), we can try the tests again. Both of the tests
allow us to claim that the series is not stationary.
Exercises
1. Show that if the series is monotone increasing, then S = T = 1.
2. Show that if the series is monotone decreasing, then S = T = −1. Hint:
You may find the following formula useful:
n(n + 1)(2n + 1)
12 + 22 + · · · + n2 =
6
3. For a given data set (will be posted in the web), find Kendall and Spearman
rank correlation coefficients. Based on them, test the data for stationarity at 5%
level of significance.
Project. For a given noise data εt , t = 1, . . . , 30 (will be posted on the web),
find out how much of the trend is necessary to be detected by the Kendall test. That
is, how big should be a (up to two decimal places) to make the series Xt = at + εt
test positive? Same about Xt = −at + εt . Same question about the Spearman test.
Which of the tests turns out to be more sensitive?
CHAPTER 2
Stationary Models
between s and t is greater than 1. Since Cov(Xt , Xt ) = Var(Xt ) = 2σε2 has been
already calculated, it remains to find Cov(Xt , Xt+1 ). We have
Cov(Xt , Xt+1 ) = Cov(εt+1 − εt , εt+2 − εt+1 )
= Cov(εt+1 , εt+2 ) − Cov(εt+1 , εt+1 ) − Cov(εt , εt+2 ) + Cov(εt , εt+1 )
= −σε2
Hence, the process is second order stationary (in fact, it is completely stationary
though we did not show that). For its autocovariance function, R(0) = 2σε2 and
R(1) = R(−1) = −σε2 , all the other values of R are zeros.
5. This is also a second order stationary process. Indeed, suppose the process
is defined by the formula (1.3). Then
EXt = E(A) cos(ωt) + E(B) sin(ωt) = 0
and
Var Xt = Var(A) cos2 (ωt) + Var(B) sin2 (ωt) + 2 Cov(A, B) cos(ωt) sin(ωt) = σ 2 .
In a similar way, it could be shown (Problem 1) that
(1.5) Cov(Xs , Xs+t ) = σ 2 cos(ωt)
If the process is defined by the equation (1.4), then it is even strictly stationary.
We will only verify second order stationarity though. We could use trig identity
and transform this equation into the (1.3) form, or we could proceed as follows.
First, we notice that
Z 2π
1
E cos(ωt + φ) = cos(ωt + x) dx = 0
2π 0
and therefore
EXt = E(A cos(ωt + φ)) = EA E cos(ωt + φ) = 0.
In a similar way,
Z 2π
1 1
E cos2 (ωt + φ) = cos2 (ωt + x) dx =
2π 0 2
and
Var Xt = EXt2 = E(A2 )E cos2 (ωt + φ) = E(A2 )/2.
Along the same lines, one can verify that
1
(1.6) Cov(Xs , Xs+t ) =E(A2 ) cos(ωt)
2
Autocorrelation and Partial Autocorrelation. Let Xt be a stationary
process with autocovariance function R(t). Its autocorrelation function or ACF, is
defined by the formula
Cov(Xs , Xs+t ) R(t)
(1.7) ρ(t) = Corr(Xs , Xs+t ) = p =
Var(Xs ) Var(Xs+t ) R(0)
So, autocorrelation function is just the autocovariance function normalized to have
ρ(0) = 1. Autocorrelation function has the following properties.
1. It is even: ρ(t) = ρ(−t).
2. ρ(0) = 1.
54
for k ≥ 1. As you can see, the determinants differ by the last column only. For
k = 1, we get φ(1) = ρ(1). It could be shown that |φ(k)| ≤ 1 for all k.
PACF and the best linear predictor. The formula (1.10) looks like some-
thing out of the blue. However, as we’ll see later, it plays an important role in the
theory of autoregressive models.
The following property partly reveals the mystery. Let Xt be a stationary
process with zero expectation. Suppose that we want to approximate a random
variable Xt+k by a linear combination of Xt , Xt+1 , . . . , Xt+k−1 in the sense of least
squares. That is, we want to minimize the expectation
Q(b1 , . . . , bk ) = E(Xt+k − b1 Xt+k−1 − · · · − bk Xt )2
To do so, we should expand the expression, note that E(Xt Xs ) = R(t − s), set
the partial derivatives with respect to bi to zero and solve the corresponding linear
equations (similar to the multiple linear regression). It could be shown (see a sketch
below) that the optimal coefficient bk (the last one) coincides with φ(k).
Yet another interpretation for the partial autocorrelation function is as follows.
Suppose we already found the coefficients b1 , . . . , bk described above. Denote by Z1
the prediction error Z1 = Xt+k −b1 Xt+k−1 −· · ·−bk Xt . Next, we want to find a best
prediction for Xt−1 given the same Xt , Xt+1 , . . . , Xt+k−1 . It could be shown that
the solution is given by the formula b1 Xt +· · ·+bk Xt+k−1 , with the same coefficients
b1 , . . . , bk , taken in the reversed order. Let now Z2 = Xt−1 − b1 Xt − · · · − bk Xt+k−1
be the corresponding prediction error. It turns out that Corr(Z1 , Z2 ) = φ(k + 1)
(and, for that reason, it does not exceed one in absolute value).
55
Data
6
Time
20 40 60 80 100 120 140
-2
-4
Data
6
Time
20 40 60 80 100 120 140
-2
-4
-6
Now, since the expectation of Xt is equal to zero, E(Xt−i )2 = σ 2 and EXt−i Xt−j =
R(j − i). For this reason,
k
X k
X k−1
X k
X
Q(b1 , . . . , bk ) = σ 2 (1 + b2i ) − 2 bi R(i) + 2 bi bj R(j − i)
i=1 i=1 i=1 j=i+1
56
Dividing by σ 2 , we get
k k k−1 k
1 X
2
X X X
Q(b1 , . . . , b k ) = 1 + bi − 2 bi ρ(i) + 2 bi bj ρ(j − i)
σ2 i=1 i=1 i=1 j=i+1
Differentiating with respect to bi and setting the partial derivatives to zero, we get
the equation
X
bi − ρ(i) + bj ρ(i − j) = 0
j,j6=i
Exercises
1. Show that the autocovariance function of the process (1.3) in the example
5 is indeed given by the formula (1.5)
2*. By setting t1 = −1, t2 = 0, t3 = 1 in (1.8), show that
ρ(2) ≥ 2ρ(1)2 − 1.
Conclude from here that −1 ≤ φ(2) ≤ 1.
Hint: Let c2 = 1, c1 = a, c3 = b and find a, b that minimize the expression in
(1.8). Compute the minimal value. Still, it must be non-negative. For the second
part, use the definition of the partial autocorrelation function.
3. For the process (1.1), show that R(s, t) = σε2 min(s, t).
4. Find the formula for the ACF of the process (1.2).
5. Find first four values of the PACF (that is, φ(1), φ(2), φ(3), φ(4)) for the
process (1.2).
6. Let Xt and Yt be two second order stationary processes such that Xt and Ys
are independent for all s and t. Show that Zt = Xt + Yt is second order stationary
and its autocovariance function RZ (k) = RX (k) + RY (k) for all k.
7. Let Xt be a second order stationary process with autocovariance function
RX (k), and let Yt = (Xt + Xt−1 )/2. Verify that Yt is a second order stationary
process. Find its autocovariance function in terms of RX (k).
8. Let Xt be a stationary process with zero expectation and autocorrelation
function ρ(k). (a) Show that b = ρ(1) minimizes the expectation
E(Xt − bXt−1 )2
as well as the expectation
E(Xt−2 − bXt−1 )2 .
(b) Verify that
Corr(Xt − ρ(1)Xt−1 , Xt−2 − ρ(1)Xt−1 ) = φ(2).
9*. Let Xt be a stationary process with zero expectation and autocorrelation
function ρ(k). (a) Find the coefficients b1 , b2 that minimize the expectation
E(Xt − b1 Xt−1 − b2 Xt−2 )2
57
and verify that b2 = φ(2). (b) For the coefficients b1 , b2 from part (a), compute
Corr(Xt − b1 Xt−1 − b2 Xt−2 , Xt−3 − b1 Xt−2 − b2 Xt−1 )
and compare it with φ(3).
Data
4
3
2
1
Time
20 40 60 80 100 120 140
-1
-2
-3
ACF
1.0
0.5
Lags
1 5 10 15 20 25 30
-0.5
-1.0
but any further analysis requires the so called invertibility condition (Section 7).
The expectation of Xt is equal to µ and its variance is equal to
(2.6) σ 2 = (b20 + · · · + b2l )σε2
The autocovariance function is given by the formula
(
(b0 bi + b1 bi+1 + · · · + bl−i bl )σε2 if i = 0, . . . , l
(2.7) R(i) =
0 if i > l
(and, of course, R(−i) = R(i) defines the values of R for negative arguments).
From (2.6) and (2.7),
( (b b +b b +···+b b )
0 i 1 i+1 l−i l
(b20 +b21 +···+b2l )
if i = 0, . . . , l
(2.8) ρ(i) =
0 if i > l
There is no simple formula for the partial autocorrelation other than (1.10). Still,
as in case of MA(1) process, φ(k) does not vanish outside of any finite interval.
Examples. On Figures 3 - 10 you can see some simulated MA(1) and MA(2)
processes together with their ACF and PACF.
General Linear Process (or a Moving Average of P∞Infinite Order).
Suppose bi , i = 0, 1, 2, . . . is an infinite sequence such that i=0 b2i < ∞. For a
white noise εt , let
Xn
Sn,t = µ + bi εt−i
i=0
59
PACF
1.0
0.5
Lags
1 5 10 15 20 25 30
-0.5
-1.0
Data
3
2
1
Time
20 40 60 80 100 120 140
-1
-2
-3
ACF
1.0
0.5
Lags
1 5 10 15 20 25 30
-0.5
-1.0
Pm
Let n < m. Note that Sn,t −Sm,t = k=n+1 bk εt−k and therefore Var(Sn,t −Sm,t ) =
Pm 2 2
k=n+1 bk σε . So, Var(Sn,t − Sm,t ) → 0 as n, m → ∞ which means that the
sequence of random variables Sn,t converges in a certain sense (so called convergence
in L2 , or in mean squares, see Appendix A, section 4). Hence, we can define a
stochastic process
∞
X
Xt = lim Sn,t = µ + bi εt−i
n→∞
i=0
60
PACF
1.0
0.5
Lags
1 5 10 15 20 25 30
-0.5
-1.0
Data
6
4
2
Time
20 40 60 80 100 120 140
-2
-4
-6
Data
6
4
2
Time
20 40 60 80 100 120 140
-2
-4
-6
As above, it implies
P∞
i=0 bi bi+k
(2.11) ρ(k) = P ∞ 2 , k > 0.
i=0 bi
Remark. Moving average processes, together with general linear processes,
have one important property. Since Xt is a linear combination of past values of
the noise εt , or a limit of those, and since εt is a white noise, that is a sequence of
i.i.d. random variables, Xt is independent from future values of the noise εt+k , k > 0.
Moreover, we will see in Section 10 that εt is, in fact, a one step ahead prediction
error.
Exercises
1. Let
Xt = m + εt + aεt−1
and
1
Yt = m + ηt + ηt−1
a
where εt and ηt are white noises. If ση2 = a2 σε2 , then the processes Xt and Yt have
the same mean and variance and the same autocorrelation function.
2. For a process
Xt = εt + εt−1 + 0.6εt−2 ,
find its autocorrelation function, and the first three values φ(1), φ(2), φ(3) of the
partial autocorrelation function.
3. The same question for a process
Xt = εt + 2εt−1 + εt−2 .
4. The same question for a process
Xt = εt − 3εt−1 + 3εt−2 − εt−3 .
5. The same question for a process
3 1
Xt = εt − εt−1 + εt−2 − εt−3 .
2 4
6. Find the autocorrelation function of the process described by the equation
εt + · · · + εt−l
Xt =
l+1
7. For a general linear process
Xt = εt + bεt−1 + b2 εt−2 + · · · + bk εt−k + . . . ,
where |b| < 1, find its variance and the autocorrelation function.
8*. Verify (2.4).
Hint: As a first step, show by induction that, for a k × k matrix
(1 + b2 )
b 0 ... 0 0
b
(1 + b2 ) b ... 0 0
0 2
b (1 + b ) . . . 0 0 ,
... ... ... ... ...
2
0 0 0 . . . (1 + b ) b
0 0 0 ... b (1 + b2 )
its determinant is equal to 1 + b2 + · · · + b2k .
62
Data
6
Time
20 40 60 80 100 120 140
-2
-4
ACF
1.0
0.5
Lags
1 5 10 15 20 25 30
-0.5
-1.0
Therefore,
(3.9) R(k) = a|k| σX
2
, ρ(k) = a|k|
Partial autocorrelation function of AR(1). Let us look at the determinant
in the numerator of the formula (1.10). To be specific, let us compare the first and
the last column (assuming, of course, that k > 1). According to (3.9), the first
column contains the numbers 1, a, a2 , . . . , ak−1 . On the other hand, the last column
contains the numbers a, a2 , a3 , . . . , ak . Therefore the last column is proportional
to the first one. Hence, the corresponding matrix is degenerate and therefore its
determinant is equal to zero. Therefore, the partial autocorrelation function φ(k) =
0 if k > 1 (and, of course, φ(0) = 1, φ(1) = ρ(1) = a).
Examples. Depending on the sign of a, we have two different types of behavior.
If a is negative, the series oscillates heavily. If a is positive, the series slowly goes
up and down, though not periodically. On Figures 11 - 16 you can see simulated
AR(1) processes with a = 0.9 and a = −0.9, together with their autocorrelation
and partial autocorrelation functions.
64
PACF
1.0
0.5
Lags
1 5 10 15 20 25 30
-0.5
-1.0
Data
6
Time
20 40 60 80 100 120 140
-2
-4
-6
ACF
1.0
0.5
Lags
1 5 10 15 20 25 30
-0.5
-1.0
PACF
1.0
0.5
Lags
1 5 10 15 20 25 30
-0.5
-1.0
Exercises
1. Does the equation
Xt + 0.75Xt−1 = εt
describe a stationary process? If so, what is its expectation? What is the variance?
What is the autocorrelation function? Which of the following data (graphs will be
posted on the web) may possibly be described by this equation?
2. The same questions for the equation
Xt − 0.75Xt−1 = εt
3. Does the equation describe a stationary process? If so, what is its expecta-
tion? What is the variance?
Xt + 0.5Xt−1 = 3 + εt
4. Same questions for the equation
Xt + Xt−1 = 3 + εt
5. Same questions for the equation
Xt − 2Xt−1 = εt
For instance,
(1 − 0.2B)Xt = Xt − 0.2BXt = Xt − 0.2Xt−1
The other way around,
(εt + εt−1 )/2 = (εt + Bεt )/2 = [(1 + B)/2]εt = (0.5 + 0.5B)εt
This notation is especially useful if we wish to consider several operations applied
to the series one after one. For instance, suppose we first want to take the incre-
ments (Xt − Xt−1 ), and after that, the averages of the two consecutive increments.
Formally, we should, first, consider Yt = Xt − Xt−1 and then, Zt = (Yt + Yt−1 )/2.
However, this boils down to the formula Zt = 0.5Xt − 0.5Xt−2 . In terms of the
back shift operator, we have
Zt = (0.5 + 0.5B)Yt = (0.5 + 0.5B)(1 − B)Xt
and
(0.5 + 0.5B)(1 − B) = 0.5(1 + B)(1 − B) = 0.5(1 − B 2 )
Operator (1 − B)Xt = Xt − Xt−1 is called the difference operator and has
a special notation ∇Xt (reads ‘nabla’). It is useful when we need to reduce the
non-stationary process to a stationary one. For instance, a process (1.1) can be
characterized by the equations
X0 = 0, Xt = Xt−1 + εt
which can be rewritten as
∇Xt = εt .
2
An expression ∇ Xt = Xt − 2Xt−1 + Xt−2 is called a second difference, and so on.
For convenience, we set ∇0 = 1, so that ∇0 Xt = Xt .
In a similar way, we define a seasonal difference ∇d Xt = (1−B d )Xt = Xt −Xt−d
which is useful, for instance, when we deal with monthly, etc., data, and expect
certain annual cycles.
In terms of B, the AR(1) model (3.1) can be rewritten as
(4.1) (1 − aB)Xt = εt
and the representation (3.5) becomes
X∞
(4.2) Xt = ( ai B i )εt .
i=0
or
Xt = (1 + aB + a2 B 2 + . . . )εt
which agrees with the power series representation
1
(4.3) (1 − az)−1 = = 1 + az + a2 z 2 + . . . , |az| < 1,
1 − az
and with the following formal implication of (4.1)
(4.4) Xt = (1 − aB)−1 εt
(which, as written, makes no sense because the right side is not formally defined).
Rule of a thumb: in formal calculations related to the power series expansions,
B should be treated as a number with absolute value that is equal to 1.
Of course, (4.4) and (4.3) do not prove (3.5) or (4.2). However, they lead to a
correct formula which can be eventually justified by other means.
67
Exercises
1. For each of the following equations, rewrite them in B notation.
Xt − 0.4Xt−1 = εt
Xt = εt + εt−1 − 0.5εt−2
Yt = Xt − Xt−1 + εt
2. Following equations are written in B notation. Rewrite them in ordinary
notation (without the back shift operator).
(1 − B + 0.5B 2 )Xt = εt
(1 − 0.2B)(1 − 0.5B)Xt = εt
(1 − B 12 )Xt = (1 + 0.5B)εt
(1 − B)2 Xt = εt
3. The values X1 , . . . , X10 are known to us and are equal to 153, 189, 221, 215,
302, 223, 201, 173, 121, 106. For which values of t,
Yt = (1 − B)2 Xt
can be computed? What are they?
a2
1
Complex Roots
0.5
a1
-2 -1 1 2
Real Roots
-0.5
-1
2 σε2
(5.16) σX = .
1 + a1 ρ(1) + a2 ρ(2)
Next,
R(1) = Cov(Xt−1 , Xt ) = Cov(Xt−1 , εt − a1 Xt−1 − a2 Xt−2 )
= Cov(Xt−1 , εt ) − a1 Cov(Xt−1 , Xt−1 ) − a2 Cov(Xt−1 , Xt−2 )
2
= 0 − a1 σX − a2 R(1).
70
2
Dividing by the variance σX , we get the following equation for the autocorrelation
function ρ:
(5.17) ρ(1) + a1 + a2 ρ(1) = 0
which implies
a1
(5.18) ρ(1) = − .
1 + a2
Next, for every k ≥ 2,
R(k) = Cov(Xt−k , Xt ) = Cov(Xt−k , εt − a1 Xt−1 − a2 Xt−2 )
= Cov(Xt−k , εt ) − a1 Cov(Xt−k , Xt−1 ) − a2 Cov(Xt−k , Xt−2 )
= 0 − a1 R(k − 1) − a2 R(k − 2)
which yields
(5.19) ρ(k) + a1 ρ(k − 1) + a2 ρ(k − 2) = 0.
The equation (5.19) is known under the name Yule—Walker equation. In particular,
it implies
a21
(5.20) ρ(2) = −a1 ρ(1) − a2 = − a2 .
1 + a2
Substituting (5.18) and (5.20) into (5.16), we get
2 (1 + a2 )σε2
(5.21) σX = .
(1 − a2 )(1 − a1 + a2 )(1 + a1 + a2 )
Note what happens to the variance as the coefficients approach the boundaries of
the triangle of stationarity.
Since ρ(0) = 1 and ρ(1) is known, (5.19) can be solved (see the details below).
Its solution can be expressed in terms of the roots z1 , z2 . If z1 6= z2 , then the
solution is given by the formula
(1 − z22 )z1k+1 − (1 − z12 )z2k+1
(5.22) ρ(k) = , k ≥ 0.
(z1 − z2 )(1 + z1 z2 )
If z1 = z2 = z∗ , then
1 − z∗2
(5.23) ρ(k) = z∗k 1 + k , k ≥ 0.
1 + z∗2
Note that, even if z1 , z2 are not real numbers, then they are complex conjugates and
(5.22) still results in a real number. Also, formulas (5.22) and (5.23) do not work
for negative k, but the relation ρ(k) = ρ(−k) allows us to find the corresponding
values.
Partial autocorrelation function of AR(2) process. As in case of the
first order autoregression, let us write down the determinant that stands in the
numerator (1.10). It is equal to
1 ρ(1) . . . ρ(1)
ρ(1) 1 . . . ρ(2)
ρ(2) ρ(1) . . . ρ(3)
... ... ... . . .
ρ(k − 1) ρ(k − 2) . . . ρ(k)
71
Assume k > 2. According to (5.17) and (5.19), if we multiply the first column
by −a1 and the second column by −a2 and add them up, then the result will
be precisely equal to the last column. Hence, the matrix is degenerate and its
determinant is equal to zero. Therefore
φ(k) = 0 for all k > 2
Also, φ(1) = ρ(1) = −a1 /(1 + a2 ) and
ρ(2) − ρ(1)2
φ(2) = = −a2
1 − ρ(1)2
Behavior of the process. If the roots z1 , z2 are real, then the behavior of
the processes, roughly, is similar to the behavior of AR(1) process that corresponds
to the biggest (in the absolute value) root. Interesting effects may occur when the
roots are complex. Let θ be such that
a1
(5.24) cos(θ) = − √
2 a2
(note that 4a2 > a21 for the roots to be complex, so a2 is positive and the expression
in (5.24) does not exceed one in absolute value). Then (5.10) can be rewritten in
the form
√ √
(5.25) z1,2 = a2 (cos(θ) ± i sin(θ)) = a2 e±iθ
and, with some help of complex exponents and trig identities (see below), (5.22)
implies
√ sin(kθ + ψ)
(5.26) ρ(k) = ( a2 )k
sin ψ
where ψ is such that
1 + a2
tan ψ = tan θ
1 − a2
So, ρ(k) can be represented as a product of a periodic function sin(kθ + ψ) and an
√
exponent ( a2 )k . As a result, the process has a quasiperiodic behavior (a cycle),
with period 2π/θ.
Examples. 1. Let us consider the equation
Xt − 0.5Xt−1 + 0.25Xt−2 = 3 + εt .
It describes a stationary process, for instance because its coefficients satisfy the
inequalities (5.12). According to (5.6), expectation of Xt is equal to 4. By (5.18)
and (5.20), its first autocorrelation ρ(1) = 0.4, second autocorrelation ρ(2) = −0.05
and σX2
= 80 2
21 σε . Next, corresponding roots z1 , z2 of the equation (5.7) are equal to
√
1±i 3 1
z1,2 = = e±iπ/3
4 2
(since their absolute values are equal to 0.5, we have another confirmation of sta-
tionarity). In order to write down a formula for a generic value ρ(k) of the au-
tocorrelation
√ function, we could either use the formula (5.26) (yes, θ = π/3, but
tan ψ = 5 3/3 is not inspiring), or solve the equation (5.19) directly. According to
Appendix C, Section 3, every solution to (5.19), in particular ρ(k), could be written
as k
1
ρ(k) = (C1 cos(kπ/3) + C2 sin(kπ/3))
2
72
Data
4
Time
20 40 60 80 100 120 140
-2
-4
-6
Data
6
Time
20 40 60 80 100 120 140
-2
-4
-6
Data
Time
20 40 60 80 100 120 140
-2
-4
a1
− 1+a 2
, we get two linear equations for C1 and C2 that can be easily solved. Namely,
we get
1 a1 1 a1
C1 = (z2 + ), C2 = 1 − C1 = (z1 + )
z2 − z1 (1 + a2 ) z1 − z2 (1 + a2 )
in case of the distinct roots, and
a1
C1 = 1, C2 = −1 −
z∗ (1 + a2 )
74
Data
10
Time
20 40 60 80 100 120 140
-5
-10
ACF
1.0
0.5
Lags
1 5 10 15 20 25 30
-0.5
-1.0
PACF
1.0
0.5
Lags
1 5 10 15 20 25 30
-0.5
-1.0
as promised.
3. Stationarity condition and General Linear Process (sketch). Sup-
pose that the stationary process Xt satisfies the equation (5.3) and the polynomial
α(z) satisfies the stationarity condition (5.4). We want to show that the process
Xt can be represented as a general linear process.
It is more convenient to switch to the equation (5.7). Its roots z1 , z2 should
satisfy the condition (5.8). Suppose, first, that z1 , z2 are distinct and real. The
equation (5.7) implies that z1 z2 = a2 and z1 + z2 = −a1 . Therefore
α(x) = 1 + a1 x + a2 x2 = (1 − z1 x)(1 − z2 x)
and
1 1 1 1 z1 z2
= = = −
α(x) 1 + a1 x + a2 x2 (1 − z1 x)(1 − z2 x) z1 − z2 1 − z1 x 1 − z2 x
(partial fractions). This way, we get the following power series representation:
∞
1 1 X
(5.28) = (z1k+1 − z2k+1 )xk
α(x) z1 − z2
k=0
and the series converges if |z1 x| < 1 and |z2 x| < 1. By substituting the shift
operator B instead of x, we get a representation
∞
1 1 X
Xt = εt = (z1k+1 − z2k+1 )B k εt
α(B) z1 − z2
k=0
(5.29) ∞
1 X
= (z1k+1 − z2k+1 )εt−k
z1 − z2
k=0
Because of (5.8), the series on the right converges in mean squares and defines a
general linear process that satisfies (5.1).
If the roots are complex, the above arguments still hold but the expression on
the right side of (5.29) contains some complex numbers and has to be adjusted.
Indeed, then z1 and z2 are complex conjugates and therefore admit a representation
√ √
z1,2 = a2 e±iθ = a2 (cos θ ± i sin θ)
√ √
for some θ. Then z1 − z2 = 2i a2 sin θ and z1k+1 − z2k+1 = 2i( a2 )k+1 sin(k + 1)θ
and the formula (5.29) boils down
∞
X√
1 sin(k + 1)θ
(5.30) Xt = εt = ( a2 )k εt−k
α(B) sin θ
k=0
Finally, suppose now that the roots are equal: z1 = z2 = z∗ . Then
1 1
α(x) = 1 + a1 x + a2 x2 = (1 − z∗ x)2 , =
α(x) (1 − z∗ x)2
and
∞
1 X
(5.31) = (k + 1)z∗k xk
α(x)
k=0
This leads to the following power series representation:
∞ ∞
1 X X
(5.32) Xt = εt = (k + 1)z∗k B k εt = (k + 1)z∗k εt−k
α(B)
k=0 k=0
77
Again, this series converges in mean squares if |z∗ | < 1 and defines a general linear
process that satisfies (5.1).
The inverse is also true: if a stationary process Xt satisfies (5.1) and, for every
k > 0, Xt is independent from εt+k , then the condition (5.7) must be valid.
Exercises
Hint to Problems 1-5: You can just follow the examples from p.71 (you may
also find Section C.3 useful). Equally, you could verify that the values ρ(0), ρ(1)
are the correct ones, and that the Yule—Walker equations (5.19) are satisfied for
all k ≥ 2. Also, you could use (5.22), (5.23) or (5.26).
1. Verify that the equation
2 1
Xt + Xt−1 − Xt−2 = εt
15 15
describes a stationary series. Show that its ACF is given by the formula
k k
9 k 1 5 1
ρ(k) = (−1) + , k ≥ 0.
14 3 14 5
Find its PACF.
2. Verify that the equation
7 1
Xt − Xt−1 + Xt−2 = εt
6 3
describes a stationary series. Show that its ACF is given by the formula
k k
5 1 9 2
ρ(k) = − + , k ≥ 0.
4 2 4 3
Find its PACF.
3. Verify that the equation
2 4
Xt − Xt−1 + Xt−2 = εt
3 9
describes a stationary series. Show that its ACF is given by the formula
k
2 5
ρ(k) = cos(πk/3) + √ sin(πk/3) , k ≥ 0.
3 13 3
Find its PACF.
4. Verify that the equation
Xt − Xt−1 + 0.5Xt−2 = εt
describes a stationary series. Show that its ACF is given by the formula
k
1 1
ρ(k) = √ cos(πk/4) + sin(πk/4) , k ≥ 0.
2 3
Find its PACF.
5. Verify that the equation
Xt − Xt−1 + 0.25Xt−2 = εt
describes a stationary series. Show that its ACF is given by the formula
k
3 1
ρ(k) = (1 + k) , k ≥ 0.
5 2
78
2 σε2
(6.5) σX =
1 + a1 ρ(1) + · · · + ak ρ(k)
and
(6.6) ρ(t) + a1 ρ(t − 1) + · · · + ak ρ(t − k) = 0
for all t ≥ 1. The equation (6.6) is called the Yule—Walker equation. Since
ρ(0) = 1 and ρ(t) = ρ(−t), the Yule—Walker equations with t = 1, . . . , k − 1 allow
us to find all the first values ρ(1), . . . , ρ(k − 1). For t ≥ k, the equation (6.6) turns
out to be a difference equation (see Appendix C, section 3) and therefore ρ(t) must
be a linear combination of exponents and damped sine functions.
The partial autocorrelation function of AR(k) process, has the property
(6.7) φ(l) = 0 if l ≥ k + 1.
Example. Consider the equation
Xt − Xt−1 + 0.5Xt−2 − 0.125Xt−3 = εt .
√
Corresponding polynomial α(z) = 1 − z + 0.5z 2 − 0.125z 3 has roots 2 and 1 ± i 3.
Since absolute values of all of the roots are equal to 2, the process is stationary. It
has an expectation zero (since no constant is involved). In order to find the first two
values of the autocorrelation function, we have to use the Yule—Walker equations.
Setting t = 1, we get the equation
ρ(1) − ρ(0) + 0.5ρ(−1) − 0.125ρ(−2) = 0.
Since ρ(0) = 1 and ρ(−1) = ρ(1), ρ(−2) = ρ(2), it is equivalent to
1.5ρ(1) − 0.125ρ(2) = 1
Next, setting t = 2, we get
ρ(2) − ρ(1) + 0.5ρ(0) − 0.125ρ(−1) = 0
which is equivalent to
−1.125ρ(1) + ρ(2) = −0.5
Solving for ρ(1) and ρ(2), we get
20 8
ρ(1) = , ρ(2) = .
29 29
We can now find
13
ρ(3) = ρ(2) − 0.5ρ(1) + 0.125ρ(0) =
232
80
and so on. Using those values and the formula (1.10), we can now find first three
values of the partial autocorrelation function (this is an AR(3) model, so φ(k) = 0
for k ≥ 4). Namely,
20 ρ(2) − ρ(1)2 8
φ(1) = ρ(1) = , φ(2) = =−
29 1 − ρ(1)2 21
and
ρ(3) + ρ(1)ρ(2)2 + ρ(1)3 − 2ρ(1)ρ(2) − ρ(1)2 ρ(3) 1
φ(3) = =
1 + 2ρ(1)2 ρ(2) − ρ(2)2 − 2ρ(1)2 8
In order to find a closed form expression for ρ(k), we have to solve the Yule-Walker
equation
(6.8) ρ(k) − ρ(k − 1) + 0.5ρ(k − 2) − 0.125ρ(k − 3) = 0, k≥3
To this end, we have to find the roots of the auxiliary equation
z 3 − z 2 + 0.5z − 0.125 = 0
√
They are equal to z1 = 1/2 and z2,3 = 41 ± i 43 , all three have the absolute value
1/2. According to Appendix C.3, general solution to the equation (6.8) is given by
the formula
1 1 kπ kπ
ρ(k) = C1 ( )k + ( )k (C2 cos( ) + C3 sin( ))
2 2 3 3
(the first term is related to z1 and the rest is related to z2,3 ). Plugging in k = 0, 1, 2,
we get three equations
1 = C1 + C2 (k = 0)
√
20 1 1 3
= (C1 + C2 + C3 ) (k = 1)
29 2 2 √2
8 1 1 3
= (C1 − C2 + C3 ) (k = 2)
29 4 2 2
Solving the equations, we get
√
21 8 10 3
C1 = , C2 = , C3 =
29 29 29
and therefore
√ !
1 21 8 kπ 10 3 kπ
ρ(k) = k + cos( ) + sin( )
2 29 29 3 29 3
for k ≥ 3.
Details. 1. Stationarity Condition and General Linear Processes. Let
1
the roots of α(z) satisfy the condition |z| > 1. It could be shown that α(z) can be
represented as a sum of its Maclaurin series
1
= 1 + C1 z + C2 z 2 + . . .
α(z)
with the radius of convergence that is greater than 1. It follows either from the
theory of analytic functions, or from the same trick that we have used with AR(2):
we factorize
α(x) = (1 − z1 x) . . . (1 − zk x)
where z1 , . . . , zk are the roots of the equation (6.3). After that, we apply the partial
fractions and the formula for the sum of a geometric series.
81
P
In any case, it means that the series Ci converges absolutely, and therefore
1
(6.9) Xt = εt = (1 + C1 B + C2 B 2 + . . . )εt
α(B)
is a well-defined general linear process. It could be verified that Xt satisfies (6.1).
For this reason, Xt is independent from the future values of the noise εt+k , k > 0.
Another way to explain the role of the stationarity condition is to consider the
equation (6.1) without noise:
(6.10) Yt + a1 Xt−1 + · · · + ak Yt−k = 0
The equation (6.10) is a difference equation of the order k, and its general solution
can be written (Appendix C, section 3) in terms of the roots z1 , . . . , zk of (6.3) . If
their absolute values |z1 | < 1, . . . |zk | < 1, then Yt → 0 as t → ∞ no matter what
are the initial values Y0 , . . . , Yk−1 . It could be shown that, with added noise, the
process Yt converges to a stationary process determined by the right side of the
equation (6.9) no matter what are Y0 , . . . , Yk−1 , and therefore Yt is asymptotically
stationary.
2. Derivation of (6.5) and (6.6). We begin with the formula
Xt = εt − a1 Xt−1 − . . . ak Xt−k .
Since εt is independent from past values of the process,
Cov(Xt , εt ) = σε2
and
k
X k
X
2
σX = Cov(Xt , Xt ) = Cov(Xt , εt ) − ak Cov(Xt , Xt−j ) = σε2 − ak R(j)
j=1 j=1
2
Next, R(j) = σX ρ(j) and therefore
2
σX (1 + a1 ρ(1) + · · · + ak ρ(k)) = σε2
which implies (6.5). Next, let i > 0. Then
k
X k
X
R(i) = Cov(Xt−i , Xt ) = Cov(Xt−i , εt )− ak Cov(Xt−i , Xt−j ) = − ak R(i−j).
j=1 j=1
2
Dividing by σX , we get (6.6).
3. Derivation of (6.7) (sketch). Let l > k. According to (1.10), φ(l)
is a quotient of two determinants. Like in AR(2) case, Yule—Walker equation
(6.6) implies that the determinant in the numerator is equal to zero, because if
we multiply the first k columns by the weights a1 , . . . , ak , and subtract the last
column, we get zero.
Exercises
1. Does the equation
Xt + aXt−1 − Xt−2 − aXt−3 = εt
describe a stationary series for any a?
2. (a) Verify that the equation
(1 − 0.5B)(1 − B + 0.5B 2 )Xt = −1 + εt
82
describes a stationary process. What is its expectation and variance? (b) Write the
Yule—Walker equations (6.6) for t = 1, 2, 3 and solve them to find ρ(1), ρ(2), ρ(3).
3*. For the same equation, find a formula for ρ(k).
4*. Does the equation
(1 + B + 0.5B 2 )(1 + 0.5B + 0.25B 2 )Xt = εt
describe a stationary process? If yes, what is its expectation and variance? Write
the Yule—Walker equations (6.6) for t = 1, 2, 3, 4 and solve them to find ρ(1), ρ(2),
ρ(3) and ρ(4).
and therefore
1 1 5 1 4 1
= = +
β(z) (1 − 0.5z)(1 + 0.4z) 9 1 − 0.5z 9 1 + 0.4z
and
∞
1 X
= 1 − 0.4z + (0.4)2 z 2 − · · · = (−1)k (0.4)k z k
1 + 0.4z
k=0
Multiplying those series by 5/9 and 4/9 and adding them up, we get the following
representation
∞
5(0.5)k + 4(−1)k (0.4)k
1 X
= zk
β(z) 9
k=0
b
ρ(1) =
1 + b2
and therefore |ρ(1)| ≤ 0.5 (moreover, ρ(1) = ±0.5 corresponds to b = ±1, and
the corresponding process does not satisfy the invertibility condition). The same
is √
true for higher order models: for MA(2), the biggest possible value for ρ(1)
√ ≈ 0.707 (Problem 3); for MA(3), the biggest possible value for√ρ(1) is
is 2/2
(1 + 5)/4 ≈ 0.809 (Problem 4); for MA(4), the biggest possible value is 3/2 ≈
0.866. It could be shown that the biggest possible value of |ρ(1)| for a MA(l) model
is equal to cos(π/(l + 2)).
Details. Convergence of the series (7.5) (sketch)*. Since the radius of
convergence of the series (7.3) is greater than one, the series converges absolutely
at z = 1 and therefore
X
|Dk | < ∞.
We are going to show that the sequence of random variables Sn,t converges in mean
squares as n → ∞. Indeed,
m
X m
X
Var(Sn,t − Sm,t ) = Cov( Di Xt−i , Dj Xt−j )
i=n+1 j=n+1
m
X X m
= Di Dj Cov(Xt−i , Xt−j )
i=n+1 j=n+1
Xm Xm
= Di Dj R(i − j)
i=n+1 j=n+1
Xm m
X
2
≤( |Di |)( |Dj |)σX
i=n+1 j=n+1
Xm
=( |Di |)2 σX
2
i=n+1
2
P
because |R(k)| ≤ σXfor all k. Since the series |Dk | converges, random variables
Sn,t form a Cauchy sequence in mean squares and therefore, converge in mean
squares (see Appendix A, section 3).
Exercises
1. Which of the following MA models satisfy the invertibility condition?
Xt = εt − εt−1 + 2εt−2
Xt = εt + εt−1 − 0.5εt−2
Xt = εt − 0.6εt−1 + 0.8εt−2
Xt = εt + εt−1 + εt−2
2. For the following MA(2) model
2 1
Xt = εt + εt−1 − εt−2 ,
15 15
find the representation (7.5), that is find the formula for the coefficients.
3. Do the same for model
Xt = εt − εt−1 + 0.5εt−2 .
4. Do the same for model
Xt = εt + εt−1 + 0.25εt−2 .
Hint: You may find the following formula useful:
1
= 1 + 2q + 3q 2 + · · · + kq k−1 + . . .
(1 − q)2
√
√ 5*. Show that |ρ(1)| ≤ 2/2 for MA(2) models, and the models with ρ(1) =
± 2/2 are not invertible. √
√ that |ρ(1)| ≤ (1 + 5)/4 for MA(3) models, and the models with
6**. Show
ρ(1) = ±(1 + 5)/4 are not invertible.
85
So, for large m, ρ(m) is a linear combination of exponents and damped sine func-
tions, as it were for AR(k) models. In order to find initial values of the autocovari-
ance/autocorrelation function, we have to evaluate the right side of (8.6), that is
to compute Cov(εt , Xs ) for t − l ≤ s ≤ t. One of the possible ways to do that is to
use (8.3).
ARMA(1,1). Let us consider a special case ARMA(1,1) (k = l = 1) in more
detail. A traditional way to write down the corresponding equation is as follows:
Xt − aXt−1 = εt + bεt−1
or
(1 − aB)Xt = (1 + bB)εt
The conditions of stationarity and invertibility imply |a| < 1, |b| < 1. We have
α(z) = 1 − az
and therefore
1
= 1 + az + a2 z 2 + . . .
α(z)
Hence, (8.3) becomes
∞
X
Xt = (1 + bB)(1 + aB + a2 B 2 + . . . )εt = (1 + ak−1 (a + b)B k )εt
k=1
With help of (2.9), (2.10) and (2.11), we get
∞
2
X (a + b)2
σX = σε2 (1 + a2k−2 (a + b)2 ) = σε2 (1 + ),
1 − a2
k=1
∞
X a(a + b)
R(l) = σε2 (al−1 (a + b) + a2k+l−1 (a + b)2 ) = σε2 al−1 (a + b)(1 + ),
1 − a2
k=1
if l ≥ 1, and
al−1 (a + b)(1 + ab)
(8.7) ρ(l) = , l ≥ 1.
1 + 2ab + b2
In particular, ρ(l) decays at exponential rate for l ≥ 1. The same is true for the
partial autocorrelation function (though the computation is difficult).
Dangers of the ARMA models. Suppose Xt = εt is just a white noise.
Then, for any a,
(8.8) Xt − aXt−1 = εt − aεt−1
which looks formally as an ARMA(1,1) process. Generalizing, if the polynomials
α(x) and β(x) have a common root, the orders k and l can be reduced by 1 (we
factorize α and β in the equation (8.2) and cancel the common factors).
If we don’t notice that and try to fit ARMA(1,1) model to the data that are ac-
tually described by (8.8), the estimation may produce strange results (for instance,
all parameters are statistically insignificant) or may even fail.
If the roots do not coincide but are close to each other, then the order of the
model can’t be formally reduced. But, still, the estimation may go wrong. Also, in
order to find a statistically significant difference between a model with close roots
and a reduced model, we need very long data sets.
Advantages of the ARMA models. As we pointed out already, the ARMA
model can be rewritten as an autoregression or moving average model of infinite
87
Data
Time
20 40 60 80 100 120 140
-5
ACF
1.0
0.5
Lags
1 5 10 15 20 25 30
-0.5
-1.0
PACF
1.0
0.5
Lags
1 5 10 15 20 25 30
-0.5
-1.0
order (see (8.4) and (8.3)) and therefore can be approximated by an autoregression
or moving average model with sufficiently large order. However, ARMA model
typically has less parameters to estimate.
In fact, ARMA models are very flexible. As we will see later, they may approx-
imate every stationary process with continuous spectral density.
Examples. On Figures 24 - 26 you can see the graph of a simulated ARMA(1,1)
process Xt = 0.9Xt−1 + εt + 0.9εt−1 , together with its ACF and PACF. Figures 27
88
Data
Time
20 40 60 80 100 120 140
-1
-2
ACF
1.0
0.5
Lags
1 5 10 15 20 25 30
-0.5
-1.0
PACF
1.0
0.5
Lags
1 5 10 15 20 25 30
-0.5
-1.0
- 29 show the graph of the process Xt = 0.9Xt−1 + εt − 0.85εt−1 with close roots,
and its ACF and PACF. As we can see, the process is very close to the white noise.
Exercises
89
or
(9.4) Xt = Xt−1 + εt
(we assume µ = 0). Iterating (9.4), we get
Xt = X0 + ε1 + ε2 + · · · + εt
(sums of i.i.d. random variables). The process is not stationary, however it does
not have a trend (that is, it can’t be decomposed into the sum of a noise and a
non-random function).
2. On contrast, suppose that
Xt = a + bt + εt
(linear trend plus white noise). Then
Yt = ∇Xt = b + εt − εt−1
which can be rewritten as
Yt − b = (1 − B)εt
which looks like (9.1) with one exception: β(z) = 1 − z does not satisfy the invert-
ibility condition (its only root is equal to 1).
This can be generalized as follows. Suppose that
Xt = a + bt + ηt
where ηt is an ARMA(k, l) process described by the equation
α(B)ηt = β(B)εt .
Then
Yt = ∇Xt = b + ∇ηt ,
and ζt = ∇ηt follows the equation
α(B)ζt = (1 − B)β(B)εt .
and therefore does not satisfy the invertibility condition because (1 − z)β(z) defi-
nitely has a root at 1. The same applies to any polynomial trends, say, parabolic
(of course, in order to eliminate a parabolic trend, you have to switch to second
increments, that is, apply the difference operator twice). So, an ARMA process
plus deterministic trend can’t be described by an ARIMA model. On Figures 32 and
33 you can see a simulated process of this type.
3. Consider now an equation
(9.5) Xt − 1.25Xt−1 + 0.25Xt−2 = εt
It can be rewritten as
A(B)Xt = εt
where
A(z) = 1 − 1.25z + 0.25z 2 = (1 − z)(1 − 0.25z)
The equation A(z) = 0 has roots z = 1 and z = 4 and therefore it does not satisfy
the stationarity condition. However, A(z) can be viewed as the product
A(z) = (1 − z)α(z)
where
α(z) = 1 − 0.25z
91
Data
50
45
40
35
Time
0 20 40 60 80 100 120 140
Data
10
Time
20 40 60 80 100 120 140
-5
-10
-15
satisfies the stationarity condition. Hence, the equation (9.5) can be rewritten as
follows
(1 − 0.25B)∇Xt = εt
or
(1 − 0.25B)Yt = εt
where Yt = ∇Xt is the first increment of Xt . The last equation describes an AR(1)
process with zero expectation, so the original equation (9.5) actually describes an
ARIMA(1,1,0) model. On Figure 30 you can see a simulated process that follows
the equation (9.5). Figure 31 shows an example of a simulated ARIMA(1,2,1)
process.
4. A slight modification of the previous example,
(9.6) Xt − 1.25Xt−1 + 0.25Xt−2 = εt + 2εt−1 ,
produces a model that does not satisfy the invertibility condition. It can still be
written as (1 − 0.25B)∇Xt = (1 + 2B)εt which looks like ARIMA(1,1,1) model, but
β(z) = 1 + 2z has the root z = 0.5.
5. Consider now an equation
Xt − 2Xt−1 + Xt−2 = εt
92
Data
40
30
20
10
Time
20 40 60 80 100 120 140
Data
44
42
40
38
36
34
32
Time
0 20 40 60 80 100 120 140
Exercises
1. Does the following equation describe an ARMA process? If yes, what are
k, l? Does it describe an ARIMA process? If yes, what are the structure parameters
(that is k, d, l?)
Xt − 0.5Xt−1 − 0.5Xt−2 = εt + 0.7εt−1
2. Same questions about the equation
Xt − 0.5Xt−1 − 0.5Xt−2 = 3 + εt + 0.7εt−1
3. Same for the equation
Xt = 2Xt−1 − Xt−2 + εt + εt−1 + 0.5εt−2
93
because, once again, the future values of the noise are independent from the past
values of the process. For the same reason,
X̂N +m (N ) = 0 whenever m ≥ 2
or
∞
X
X̂N +1 (N ) = b (−1)k bk XN −k
k=0
Once again, assume that we actually know infinitely many past values of the process.
Similar to MA(1) case, we have
εN = XN − aXN −1 − bεN −1
= XN − aXN −1 − b(XN −1 − aXN −2 − bεN −2 )
= XN − (a + b)XN −1 + abXN −2 + b2 εN −2
...
∞
X
= XN + (−1)k (a + b)bk−1 XN −k
k=1
96
and therefore
XN +1 = εN +1 + aXN + bεN
∞
X
= εN +1 + aXN + bXN + b (−1)k (a + b)bk−1 XN −k
(10.4) k=1
∞
X
= εN +1 + (−1)k (a + b)bk XN −k
k=0
Taking a conditional expectation given XN , XN −1 , . . . , we get a formula
∞
X
X̂N +1 (N ) = (−1)k (a + b)bk XN −k
k=0
Exercises
1. A process Xt satisfies the equation
Xt − 0.8Xt−1 = εt
Give a formula for an m steps ahead prediction, m ≥ 1.
2. Same question about the process
Xt = εt − 0.8εt−1
3. A process Xt is an ARIMA(0,2,0) process, that is, Xt satisfies the equation
∇2 Xt = (1 − B)2 Xt = εt
Give a formula for one step ahead prediction.
CHAPTER 3
(1.2)
N N N N N X
2 X N
1 XX 1 XX σX
Var X̄ = Cov(Xk , Xl ) = R(k − l) = ρ(k − l)
N2 N2 N2
k=1 l=1 k=1 l=1 k=1 l=1
2
σX
= (N + 2(N − 1)ρ(1) + 2(N − 2)ρ(2) + · · · + 2ρ(N − 1))
N2
99
100
and therefore
2
σX
Var X̄ ≤
(1 + 2|ρ(1)| + 2|ρ(2)| + · · · + 2|ρ(N − 1)|)
N
∞
σ2 X
≤ X (1 + 2 |ρ(k)|) → 0
N
k=1
P∞
as N → ∞ if the sum k=1 |ρ(k)| is finite.
Estimation of the variance. It is reasonable to begin with
N
2 1X
σ̂ = (Xi − X̄)2
N i=1
We may expect this estimate to be biased. Indeed, it is biased even if we work
with a sample, that is, with a collection of i.i.d. random variables. In case of a
sample, we are able to improve the situation by considering another estimate s2
given by (B.1.2). In case of a time series, consecutive values of the series may have
non-trivial correlations, and that trick is no longer working.
Indeed, similar to (B.1.1), we can show that
N
1X
(1.3) σ̂ 2 = (Xi − µ)2 − (X̄ − µ)2
N i=1
which implies
E σ̂ 2 = σX
2
− Var X̄
σ2 2
In case of a sample, the bias Var X̄ = NX is proportional to σX . In case of a
time series, the bias Var X̄ is given by (1.2), and an unbiased estimate can’t be
constructed unless we know the autocorrelation function in advance, which never
happens. For these reasons, whenever we estimate second moments, like variance,
autocovariance and autocorrelation function, we can’t get rid of bias. We should
be happy if we get asymptotically unbiased and consistent estimates.
2
So, can we claim that σ̂X is asymptotically unbiased? Let us assume that the
series (1.1) converges. This is harmless enough, in particular because otherwise
we can’t even estimate the mean. Under this assumption, Var X̄ → 0 as N →
2 2
∞ by (1.2). But Var X̄ is precisely the bias of the estimate σ̂X . Hence σ̂X is
asymptotically unbiased under the above assumption. Is it consistent? We will
discuss that in the next section.
Details. Verification of (1.3). Easy! We write
N
X N
X
(Xi − µ)2 = ((Xi − X̄) + (X̄ − µ))2
i=1 i=1
N
X N
X
= (Xi − X̄)2 + 2(X̄ − µ) (Xi − X̄) + N (X̄ − µ)2
i=1 i=1
PN
and note that i=1 (Xi − X̄) = 0.
Exercises
1. Let Xt be a MA(1) process Xt = m + εt + aεt−1 where εt is the white noise.
(a) Compute mean and variance of Xt in terms of the coefficients of the model and
101
the variance of the noise σε2 . (b) Compute the variance of the sample mean X̄ in
terms of a and σε2 . Find the quotient
Var(X̄)
2 /N
σX
2
and graph it as a function of a, −1 < a < 1. Which one is bigger, Var(X̄) or σX /N ?
2. Let Xt be an AR(1) process Xt = c + aXt−1 + εt where εt is the white noise.
(a) Compute mean and variance of Xt in terms of the coefficients of the model and
the variance of the noise σε2 . (b) Compute the variance of the sample mean X̄ in
terms of a and σε2 . Find the quotient
Var(X̄)
2 /N
σX
2
and graph it as a function of a, −1 < a < 1. Which one is bigger, Var(X̄) or σX /N ?
need to know (or be able to estimate) the bias, the variance and the covariance of
the estimates.
We have already seen that the estimate for the variance, and therefore the
estimate for R(0), is asymptotically unbiased. Similar computations (see below)
allow us to show that
k N −k
E R̂(k) ≈ R(k) − R(k) − Var X̄
N N
k N −k 2 N −1
(2.4) = R(k) − R(k) − 2
σX (1 + 2 ρ(1)
N N N
N −2 2
+2 ρ(2) + · · · + ρ(N − 1))
N N
The last two terms represent the bias. It does not exceed
∞
k N −k 2 X
R(k) + σ X (1 + 2 |ρ(k)|)
N N2
k=1
P∞
If the series k=1 |ρ(k)|) converges (and, without that, even the estimation of the
mean is a problem), then the bias goes to zero as N → ∞ and our estimate is
asymptotically unbiased for every particular k.
In order to show that R̂(k) is a consistent estimate for R(k), we need to be able
to estimate the variance of R̂(k). However, in order to be able to study properties of
sample ACF and sample PACF, we need also things like Cov(R̂(k), R̂(k + l)). This
part is much harder and requires an additional assumption. Namely, we assume
that the process Xt is Gaussian. This assumption helps us to reduce expressions of
the type EXk Xl Xn Xm to the covariance function (see details below). As a result,
we can show that
∞ ∞
1 X 1 X
(2.5) Cov(R̂(k), R̂(k + l)) ≈ R(t)R(t + l) + R(t − k)R(t + k + l)
N t=−∞ N −∞
Setting l = 0, we get
∞ ∞
1 X 1 X
(2.6) Var(R̂(k)) ≈ R(t)2 + R(t − k)R(t + k)
N t=−∞ N −∞
In particular,
∞
2 2 X
(2.7) Var(R̂(0)) = Var(σ̂X )≈ R(t)2
N t=−∞
2
Formulae (2.7) and (2.6) imply, in particular, that σ̂X as well as the sample ACV
R̂(k), are consistent estimates for variance and for the autocovariance function.
Details. 1. Verification of (2.4) (sketch). To simplify computations, let
us assume that
N −k N
1 X 1 X
Xt ≈ Xt ≈ X̄
N − k t=1 N −k
k+1
(the average of the any N − k terms of the series is approximately equal to the
average of all N terms). This definitely true if k is much smaller than N . Then
N −k N −k
1 X 1 X
(Xt − X̄)(XN +k − X̄) ≈ (Xt − µ)(Xt+k − µ) − (X̄ − µ)2
N − k t=1 N − k t=1
103
where µ = EXt is the expectation of the series. However, the expectation of the
right side is exactly equal to
R(k) − Var(X̄).
On the other hand, the expression on the left is equal to NN−k R̂(k) (compare with
(2.1)). Substituting Var(X̄) from (1.2), we get (2.4).
2. Verification of (2.5) (sketch). In order to demonstrate how it could
be done, we consider a very special case. Namely, suppose that the expectation
µ = EXt is known to us in advance, and it is equal to zero. In this, rather unlikely,
situation, we can replace X̄ by zero in the formula for the sample autocovariance
function. So, the formula (2.1) becomes
N −k
1 X
(2.8) R̂(k) = R̂(−k) = Xi Xi+k , k = 0, 1, . . . , N − 1
N i=1
Assuming that the series is Gaussian, we can use the formula (A.2.7). We get
E(Xs Xs+k Xt Xt+k+l ) = R(t − s)R(t − s + l)
+ R(t − s + k + l)R(t − s − k) + R(k)R(k + l)
and (2.10) becomes
N −k N −k−l
1 X X
Cov(R̂(k), R̂(k + l)) = 2 R(t − s)R(t − s + l)
N s=1 t=1
N −k N −k−l
1 X X
+ R(t − s + k + l)R(t − s − k)
N 2 s=1 t=1
(2.11)
N −k N −k−l
1 X X
+ R(k)R(k + l)
N 2 s=1 t=1
N −kN −k−l
− R(k)R(k + l)
N N
The last two terms cancel each other, and the first two can be simplified if we collect
terms with the same value of t − s.
Indeed, let i = t − s. We have −N + k + 1 ≤ i ≤ N − k − l − 1. Assume,
for instance, that l ≥ 0 and denote by bi the number of terms that correspond to
104
ACF
1.0
0.5
Lags
1 5 10 15 20 25 30
-0.5
-1.0
PACF
1.0
0.5
Lags
1 5 10 15 20 25 30
-0.5
-1.0
Exercises
105
ACF
1.0
0.5
Lags
1 5 10 15 20 25 30
-0.5
-1.0
PACF
1.0
0.5
Lags
1 5 10 15 20 25 30
-0.5
-1.0
ACF
1.0
0.5
Lags
1 5 10 15 20 25 30
-0.5
-1.0
1. For a data set (will be posted on the web), estimate mean, variance, first
ten values of ACF (that is, ρ(1), . . . , ρ(10)) and first three values of PACF (that is,
φ(1), φ(2), φ(3)).
106
PACF
1.0
0.5
Lags
1 5 10 15 20 25 30
-0.5
-1.0
ACF
1.0
0.5
Lags
1 5 10 15 20 25 30
-0.5
-1.0
PACF
1.0
0.5
Lags
1 5 10 15 20 25 30
-0.5
-1.0
ACF
1.0
0.5
Lags
1 5 10 15 20 25 30
-0.5
-1.0
PACF
1.0
0.5
Lags
1 5 10 15 20 25 30
-0.5
-1.0
ACF
1.0
0.5
Lags
1 5 10 15 20 25 30
-0.5
-1.0
PACF
1.0
0.5
Lags
1 5 10 15 20 25 30
-0.5
-1.0
and
∞
1 X
Cov(ρ̂(k), ρ̂(l)) ≈ [ρ(t)ρ(t + l − k) + ρ(t + l)ρ(t − k)
(3.3) N t=−∞
+ 2ρ(k)ρ(l)ρ2 (t) − 2ρ(k)ρ(t)ρ(t − l) − 2ρ(l)ρ(t)ρ(t − k)]
In particular, the values ρ̂(k) may be highly correlated for different k. Also, it could
be shown that, for large N , the distribution of ρ̂(k) is approximately normal.
Justification of (3.1), (3.2) and (3.3) is really hard. Some arguments that are
designed to illustrate the ideas and to outline the assumptions, could be found at
the end of the section.
Examples. 1. White noise. If Xt is a white noise, then ρ(k) = 0 for all
k 6= 0 and therefore
1
E ρ̂(k) ≈ 0, Var ρ̂(k) ≈ , Cov(ρ̂(k), ρ̂(l)) ≈ 0
N
whenever k, l > 0, k 6= l.
109
ACF
1.0
0.5
Lags
1 5 10 15 20 25 30
-0.5
-1.0
Figure 13. Sample ACF can be computed for any series, not
necessarily a stationary one. Here you can see a sample ACF for
the ibm data (see Figure I.33).
ACF
1.0
0.5
Lags
1 5 10 15 20 25 30
-0.5
-1.0
Figure 14. Sample ACF of the service data (see Figure 2 in the
Introduction) as an example of sample ACF for the data with trend
and seasonality.
for k > 1,
1
E ρ̂(k) ≈ 0, Var ρ̂(k) ≈ (1 + 2ρ2 (1))
N
and
Cov(ρ̂(k), ρ̂(m)) ≈ 0
if k > 1, m > k + 2.
3. MA(l) process. Similar to the previous example,
1
E ρ̂(k) ≈ 0, Var ρ̂(k) ≈ (1 + 2ρ2 (1) + · · · + 2ρ2 (l))
N
whenever k > l, and
Cov(ρ̂(k), ρ̂(m)) ≈ 0
if k > l, m > k + 2l.
4. AR(1) process. Suppose Xt satisfies (II.3.1). In this particular case,
ρ(k) = a|k| and the sum of the series (3.2) can be evaluated. Assuming that k is
large and the parameter a is not too close to 1 or −1 (so that ak can be ignored),
110
we have
1 1 + a2
Var ρ̂(k) ≈
N 1 − a2
Sample PACF. Properties of the sample PACF are much harder to establish.
For certain, φ̂(l) is a consistent (and therefore asymptotically unbiased) estimate
for the partial autocorrelation function φ(l), just because the sample ACF is a
consistent estimate for the autocorrelation function. However, its distribution is
really hard to find.
As we can guess, properties of φ̂(l) are especially important to us in case if Xt
is an AR(k) process. Indeed, φ(l) = 0 for all l > k for such processes and this
property is a defining characteristic of AR(k) model. In this particular case, it
could be shown that, if l > k, then
(3.4) E φ̂(l) ≈ 0.
If, in addition, the process is Gaussian, then
1
(3.5) Var φ̂(l) ≈
N
Moreover, it could be shown that, φ̂(l), l > k are approximately independent, and
their distribution is approximately normal.
Choice of the order of the autoregression model. The above results may
help us to chose the order of the autoregression model by examining the sample
PACF. Indeed, if the order of autoregression is k, then random variables φ̂(l), l >
k are (approximately) i.i.d. normal random variables with √ zero expectation and
variance 1/n. For each of them, the probability P {|φ̂(l)| > 2/ N } is approximately
√ M , for instance M = 25, and consider all l, k < l ≤ k+M ,
.05. Choose some number
such that |φ̂(l)| > 2/ N . The number of such ls has a binomial distribution with
parameters .05 and M , and its distribution can be calculated. For instance, the
probability that
P {no more than three of the values
(3.6) √
|φ̂(l)|, l = k + 1, . . . , l = k + 25 exceed 2/ N } ≈ .96
So,√the decision rule could be as follows: look for the smallest k such that |φ̂(k+1)| <
2/ N and the condition in (3.6) (or similar) is satisfied. This could be a reasonable
first guess.
For example, suppose the first values of sample PACF are equal to φ̂(1) =
0.7, φ̂(2) = −0.4, φ̂(3) = −0.5, φ̂(4) = 0.2 and all other
√ values do not exceed 0.1.
Suppose also that the sample size N = 400. Then 2/ N = 0.1. In order to satisfy
the condition in (3.6), we could take l = 1. However, next three values are all
greater than 0.1 in absolute value. For that reason, we have to advance to l = 4, so
that the next value φ̂(5) is less than 0.1. If, however, φ̂(2) = −0.02, we would be in
a difficult position. Formally, the decision rule suggests k = 1. However, φ̂(3) and
φ̂(4) deviate from zero, respectively, by ten standard deviations and four standard
deviations. It is therefore quite reasonable to go for l = 4 nonetheless.
Choice of the order of a moving average model. In a similar way, we
can study MA(l) models. If the order of the moving average model is l, then
random variables ρ̂(k), k > l have (approximately) normal distribution with zero
111
Concentration Data
18.0
17.5
17.0
16.5
Time
50 100 150 200
ACF
1.0
0.5
Lags
1 5 10 15 20 25 30
-0.5
-1.0
PACF
1.0
0.5
Lags
1 5 10 15 20 25 30
-0.5
-1.0
look differently (Figure 23, white noise?). If we disregard the change of behavior
and estimate the ACF for the whole series (Figure 24), we may arrive at wrong
conclusions (white noise for the whole series is not working).
Details. Justification of (3.1) and (3.2) (sketch). Denote Mk = E R̂(k)
and δk = R̂(k) − Mk , so that Eδk = 0. By (2.6), the variance of R̂(k) goes to zero
113
Increments
1.5
1.0
0.5
Time
50 100 150 200
-0.5
-1.0
ACF
1.0
0.5
Lags
1 5 10 15 20 25 30
-0.5
-1.0
PACF
1.0
0.5
Lags
1 5 10 15 20 25 30
-0.5
-1.0
ACF
1.0
0.5
Lags
1 5 10 15 20 25 30
-0.5
-1.0
Figure 21. Sample ACF of the increments of the IBM data esti-
mated over the first 235 points. MA(1)?
PACF
1.0
0.5
Lags
1 5 10 15 20 25 30
-0.5
-1.0
Figure 22. Sample PACF of the increments of the IBM data es-
timated over the first 235 points. AR(2)?
ACF
1.0
0.5
Lags
1 5 10 15 20 25 30
-0.5
-1.0
Figure 23. Sample ACF of the increments of the IBM data esti-
mated over the rest of the series. White noise?
ACF
1.0
0.5
Lags
1 5 10 15 20 25 30
-0.5
-1.0
Figure 24. Sample ACF of the increments of the IBM data es-
timated over the whole series. Looks like white noise (though it’s
not working).
Exercises
1. Estimated values φ̂(1), . . . , φ̂(30) of sample PACF are given (will be posted
on the web). Following the procedure described in Section 3.3, make a suggestion
about the order of an autoregression model AR(k) assuming that (a) N = 100 (b)
N = 400 (c) N = 900 (d) N = 1600. Explain your decision.
2. Estimated values ρ̂(1), . . . , ρ̂(30) of sample ACF are given (will be posted
on the web). Following the procedure described in Section 3.3, make a suggestion
about the order of a moving average model MA(l) assuming that (a) N = 100 (b)
N = 400 (c) N = 900 (d) N = 1600. Explain your decision.
In order to find the minimum, we have to find the partial derivatives with respect
to the parameters and set them to zero. Unfortunately, the model is not linear
with respect to the parameters µ, a1 , . . . , ak which leads to complicated equations.
It could be made linear if we rewrite the equation (4.1) as
(4.3) Xt = a0 + a1 Xt−1 + · · · + ak Xt−k + εt
where
a0 = µ(1 − a1 − a2 − · · · − ak ).
Then the expression (4.2) becomes
X
(4.4) (Xt − a0 − a1 Xt−1 − · · · − ak Xt−k )2
t
Setting partial derivatives to zero, we get linear equations for a0 , . . . , ak . Solving
for them, we get the estimates â0 , . . . , âk . Finally, we set
â0
µ̂ =
1 − â1 − · · · − âk
This way we get simple expressions for a0 , . . . , ak and a really ugly looking estimate
for the expectation µ.
2. Approximate least squares. This is a modification of the previous method.
We, first, estimate the mean, that is replace µ by X̄ and apply least squares after
that. In other words, we minimize the sum
X
((Xt − X̄) − a1 (Xt−1 − X̄) − · · · − ak (Xt−k − X̄))2 .
t
Once again, we are getting linear equations here.
3. Yule—Walker estimates. As we have seen, the values of the autocorrela-
tion function must satisfy the Yule—Walker equations. Taking into account that
ρ(0) = 1, ρ(i) = ρ(−i), we are getting k equations which relate a1 , . . . , ak and
ρ(1), . . . , ρ(k):
(4.5) ρ(t) − a1 ρ(t − 1) − · · · − ak ρ(t − k) = 0, t = 1, 2, . . . , k
(Compare with (II.6.6). Since the equation (II.6.1) has been replaced by (4.1), we
have negative signs in (4.5).) Replacing the values of an autocorrelation function
by their estimates ρ̂(1), . . . , ρ̂(k), and solving them for a1 , . . . , ak , we are getting
the Yule—Walker estimates for the coefficients a1 , . . . , ak . This method does not
provide us with the estimate for the mean, we have to estimate it separately (for
instance, using the sample mean X̄).
4. Conditional maximum likelihood method. We compute a conditional density
of Xk+1 , . . . XN given X1 , . . . , Xk , plug in the observations and maximize the ex-
pression with respect to unknown parameters. For the sake of brevity, we assume
that k = 1 (first order autoregression).
In order to speak about likelihood, we have to make assumptions about the
distribution of the process. So, we assume that the process is Gaussian, that is the
noise εt has normal distribution with zero expectation and the variance σε2 . It is
convenient to use the (4.3) form of the model, which becomes
Xt = a0 + a1 Xt−1 + εt
Random variables
εt = Xt − a0 − a1 Xt−1 , t = 2, . . . , N
117
+ · · · + (xN − axN −1 )2 ]
(we have dropped a0 and replaced a1 by a). On the other hand, the density of X1
is also known to us. Namely, X1 has normal distribution with zero expectation and
2
variance σX = σε2 /(1 − a2 ), and its density is therefore equal to
√
1 − a2 1 − a2 2
(4.11) fX1 (x1 ) = √ exp − x
2πσε 2σε2 1
118
We have to multiply the conditional density (4.10) by the density (4.11) and plug
in the data. We get the following expression:
√
1 − a2 1
N exp − 2 Q(a)
(2π) 2 σεN 2σε
where
Q(a) = [(1 − a2 )X12 + (X2 − aX1 )2 + · · · + (XN − aXN −1 )2 ]
We need to maximize the resulting expression with respect to the unknown param-
eters. Taking the logarithm, we get the following log likelihood function:
1 N 1
L(a, σε2 ) = log(1 − a2 ) − log(2π) − N log σε − 2 Q(a)
2 2 2σε
Note that estimation of a can no longer be separated from estimation of the variance
σε2 . So, in order to find the maximum, we have to use some numerical algorithms
(we can’t just set the partial derivatives to zero).
Further comments. Asymptotically, all the methods are equivalent. Nowa-
days standard method is the exact maximum likelihood. Other methods provide
‘rough’ estimates for the parameters which can be used as initial values for the
iterations. Also, they could be used for diagnostic.
Exercises
1. Given values of the sample ACF (will be posted on the web), find the
Yule—Walker estimates for the coefficients of AR(2) model.
(5.1) Xt = εt − bεt−1
where b is the unknown parameter, |b| < 1, and εt is the Gaussian white noise with
unknown variance σε2 .
All methods of estimation of the parameters of the model are somehow related
to the maximum likelihood method.
Assume that the process is Gaussian. Then X1 , . . . , XN have a multivariate
Gaussian distribution with zero expectation and known covariance matrix of a sim-
ple structure. So, the joint density of X1 , . . . , XN can be written and the maximum
likelihood method could be used, at least in theory. However, this straightforward
application of the maximum likelihood did not prove to be fruitful (computational
problems, especially in case of a general model, are very hard; forecasting is diffi-
cult as well). We’ll discuss another two modifications of the maximum likelihood
method.
119
with
Q(a, b, X0 , ε0 ) = (X1 + aX0 + bε0 )2 + (X2 + (a + b)X1 + abX0 + b2 ε0 )2
+ (X3 + (a + b)X2 + (a + b)bX1 + ab2 X0 + b3 ε0 )2 + ...
(5.5)
+ (XN + (a + b)XN −1 + . . .
+ (a + b)bN −2 X1 + abN −1 X0 + bN ε0 )2
As above, we can either set X0 and ε0 to zero (conditional likelihood), or write
down the joint density of X0 , X1 , . . . , XN , ε0 and treat X0 and ε0 as unknown
parameters (preferable). In the latter case, Q is quadratic in X0 and ε0 , so we can
find them as functions of the observations X1 , . . . , XN and the parameters a, b.
which is, essentially, the same as (6.3). In case of MA and ARMA models, defi-
nition of the function Q(a) is more complicated. Let us, for instance, consider an
ARMA(1,1) model with zero expectation, (like in Section 5), and Let Q(a, b, ε0 , X0 )
be the function defined by (5.5). It is quadratic in ε0 and X0 . Hence, we can easily
find a minimum
Q(a, b) = min Q(a, b, ε0 , X0 )
X0 ,ε0
which is now a function of a and b only. In a similar way, we can compute the sum
of the squares Q(a) for any ARMA model. The corresponding critical region could
be then constructed as in (6.4), where q should be the total number of unknown
parameters in the model (the order of autoregression plus the order of moving
average, plus one for the mean if the mean is not known), and n should be not the
number of observations N but the number of the residuals in the model (in most
cases, N minus the order of the autoregression).
7. Comparison of Models
Which model should be used for a given data set? Should we use an AR(k) or
MA(l) or a mixed model? How to choose k, l, that is how to choose a structure of
a model? There exist two ways to handle the problem. According to one of them,
we use a certain numerical criterion as an overall quality of a fitted model. We
then choose the model with the best score. According to the second approach, we
develop a concept of a “good” model. Our goal is to find a “good” model (there
123
could be several of them, or there could be none) and then, pick one of the “good”
ones.
Both approaches have advantages and disadvantages. With the first approach,
we don’t have to think, everything is left to a software. However, there is no reliable
overall criterion. All existing criteria have certain flaws. Second approach gives
more flexibility, but, requires a certain level of expertise (you have to understand
what you are doing). We begin with some formal criteria.
Residual Variance. Let q = k + l + 1 be the total number of parameters in the
model. As a first guess, we may consider a quotient Q(â)/(N − q) which, in case of
multiple linear regression, works as an unbiased estimator for the variance of the
noise σε2 . Unfortunately, it is well known in statistics that such overall criterion
over-parameterizes the model. The penalty for the use of extra parameters, hidden
in the denominator, is way too small.
Finite Prediction Error. For AR(k) models, we could use another function,
N +k
F P E(k) = (R̂(0) + â1 R̂(1) + · · · + âk R̂(k))
N −k
which is an unbiased estimate for the variance of the one step ahead prediction
error.
Akaike Information criterion, or AIC. Formally, AIC is defined as
AIC = −2L(θ̂) + 2q
where L(θ̂) is the maximal value of the log likelihood function. Since the log
likelihood function for ARMA models is equivalent to Q(â), we see that AIC is
roughly equivalent to the Residual variance criterion plus some extra penalty 2q.
It is also roughly equivalent to the logarithm of the FPE criterion whenever both
of them are applicable.
Unfortunately, the AIC criterion also tends to over-parameterize the model.
Bayesian Information Criterion, or BIC. To cure the over-parametrization
problem related to the AIC and FPE criteria, another modification was suggested.
It is defined by the formula
1 σ̂ 2
BIC(q) = AIC(q) + q(log N − 1) + q log( ( X2 − 1))
q σ̂ε
This criterion is a bit better than the AIC (extra penalty term q(log N − 1) is
important, over-parametrization is not likely), but still ...
“Good model” approach. Roughly speaking, residuals in a “good” model
should be a white noise. At the same time, there should be no signs of over-
parametrization, so that no parameters can be dropped from the model.
A number of procedures could be used in order to check if the residuals are the
white noise. To begin with, we should check if the first values (like ρ(1), ρ(2) and
possibly ρ(3)) of the sample autocorrelation function for the residuals are significant.
The following test, called the Portmanteau lack-of-fit test, could also be used. Let
ρ̂(k) be the sample ACF for the residuals. If the ARMA(k, l) model is adequate,
then
N (ρ̂(1)2 + ρ̂(2)2 + · · · + ρ̂(m)2 )
has, approximately, a χ2 (m − k − l) distribution. A number m of the values used
in the test is at our disposal. It should not be comparable with the data length N .
124
Data
4.5
4.0
3.5
3.0
Time
0 20 40 60 80
ACF
1.0
0.5
Lags
1 5 10 15 20 25 30
-0.5
-1.0
Figure 26. ACF for Coal production data. Moving average mod-
els are not likely.
PACF
1.0
0.5
Lags
1 5 10 15 20 25 30
-0.5
-1.0
after the lag 10, some of them as big as −0.5, makes it questionable. So, we look
at the PACF instead (Figure 30). It clearly suggests AR(1) model. Trying an
AR(1) model, we get estimates m̂ = 4.699 and â1 = 0.876 with p-values that are
practically zeroes (no surprise). As for the residuals the Portmanteau test value
is 22.34, with p-value 0.5587, which looks really good. However, when we look at
ACF and PACF values for the residuals, we see a number of values that are pretty
big (ρ̂(4) = −0.2152, ϕ̂(4) = −0.2596, ϕ̂(8) = −0.2673 with 95% confidence bounds
±0.2191.
For that reason, we probably should try to add more parameters to the model.
However, an AR(2) model does not solve the problem. First of all, estimates
m̂ = 4.716, â1 = 1.026 and â2 = −0.173, and p-value for a2 is 0.1311, so the
senior coefficient in the model is not statistically significant. And, we still have
big values of ACF and PACF for the residuals, the value of Portmanteau test does
not significantly drop. So, we try a mixed model ARMA(1,1) instead. We get
m̂ = 4.714, â1 = 0.8281 and b̂1 = −0.2024, the biggest p-value is 0.007, so all the
coefficients of the model are statistically significant. The situation with ACF and
PACF for residuals improves somewhat, ρ̂(4) = −0.205 and ϕ̂(4) = −0.208 are now
within confidence bounds, the value of the Portmanteau test is 19.06 with p-value
0.6976. Only ϕ̂(8) = −0.2296 is slightly out of range.
So, we have two reasonably good models here, AR(1) and ARMA(1,1). In order
to decide, which one to choose, we could drop the last few values, fit the model to
126
Data
5.5
5.0
4.5
4.0
Time
0 20 40 60 80
ACF
1.0
0.5
Lags
1 5 10 15 20 25 30
-0.5
-1.0
Figure 29. ACF for Profit Margin data. Moving average models
are not likely.
PACF
1.0
0.5
Lags
1 5 10 15 20 25 30
-0.5
-1.0
the remaining part of the series and compare the prediction with actual data. If
we do so, we might like ARMA(1,1) better (Figures 31, 32).
3. We switch now to Parts Availability data, 100 points, weekly, years 1971-
1972 (Figure 33). This time, both Kendall and Spearman tests turn positive even
at 99% level of confidence, so the series is probably not stationary. However, if
we look at the graph of the series, we may consider it as a stationary series with
long-periodic cycle. Indeed, Kendall and Spearman tests are designed for a trend-
plus-white-noise data type and they can be easily confused in case of long-periodic
cycles. So, we should probably try to find an ARMA model for the original data
127
Data
6.0
5.5
5.0
4.5
4.0
3.5
Time
70 72 74 76 78 80 82
Data
6.0
5.5
5.0
4.5
4.0
3.5
Time
70 72 74 76 78 80 82
Data
8.8
8.6
8.4
8.2
8.0
7.8
Time
0 20 40 60 80
as well as try to find an ARMA model for the increments, so that it would be an
ARIMA model for the original series.
We begin with the series itself. Its ACF (Figure 34) does not suggest any
reasonable MA model. However, its PACF (Figure 35) suggests an AR(3) model.
Indeed, this way we get a ‘good’ model, m̂ = 8, 229 with p-value practically zero and
â1 = 0.1505, â2 = 0.2416, â3 = 0.3518 with p-values 0.1739, 0.0348 and 0.0024 re-
spectively. So, all the coefficients of the model except a1 are statistically significant.
Portmanteau test value is 14.16 with p-value 0.8956. Naturally, we should try other
128
ACF
1.0
0.5
Lags
1 5 10 15 20 25 30
-0.5
-1.0
PACF
1.0
0.5
Lags
1 5 10 15 20 25 30
-0.5
-1.0
Figure 35. PACF for Profit Margin data. AR(3) might work.
models. Surprisingly, an AR(2) model is also good—this time, all the parameters
are statistically significant and the Portmanteau test value is 19.01 with p-value
0.7008. For an AR(4) model, the senior coefficient a4 is statistically insignificant.
If we try ARMA(3,1) model instead, then all three autoregression coefficients are
statistically insignificant, a sure sign of wrong parametrization.
We now switch to the increments (Figure 36). Its ACF (Figure 37) suggests
MA(1) model. If we look at the PACF instead (Figure 38), it clearly suggests
AR(2) model. Trying an MA(1) model, we get b̂1 = 0.7249 with p-value that is
practically zero, and the value of the Portmanteau test is 12.23, with p-value 0.9722,
unbelievably good. So, it is an ARIMA(0,1,1) model for the original data set. For
MA(2) model, we get the senior coefficient statistically insignificant.
For an AR(2) model, we get â1 = −0.7646 and â2 = −0.4401, both p-values are
practically zeroes. The value of the Portmanteau test is 14.5 with p-value 0.9177,
still great. So it is an ARIMA(2,1,0) model for the original series. For AR(3)
model, the senior coefficient is insignificant, for AR(1) the residuals fail the white
noise test. As a result, we get four ‘good’ models — AR(3), AR(2), ARIMA(2,1,0)
and ARIMA(0,1,1). To choose one out of them, we drop six points from the series,
construct a prediction and compare it with actual data set. The results are shown
on Figures 39, 40, 41 and 42. The ARIMA(2,1,0) prediction looks slightly better.
Data
0.5
Time
20 40 60 80
-0.5
ACF
1.0
0.5
Lags
1 5 10 15 20 25 30
-0.5
-1.0
PACF
1.0
0.5
Lags
1 5 10 15 20 25 30
-0.5
-1.0
the change of the level as a trend or it is a long periodic cycle type behavior?
The first thirty values of its ACF (Figure 16) are non-negative, this might be an
evidence of a trend. Kendall and Spearman tests are also positive. However, its
PACF (Figure 17) suggests an AR(2) model. Indeed, an autoregression process of
second order may contain a long periodic cycle. Trying an AR(2) model, we get an
estimate for the mean m̂ = 17.06 with p-value that is practically zero (no surprise)
and estimates â1 = 0.4263 and â2 = 0.2576, both p-values are practically zeroes.
So, all the parameters of the model are statistically significant, and the value of the
130
Data
9.5
9.0
8.5
8.0
Time
80 82 84 86 88 90 92
Data
9.5
9.0
8.5
8.0
Time
80 82 84 86 88 90 92
Data
9.5
9.0
8.5
8.0
Time
80 82 84 86 88 90 92
Portmanteau test is 26.97 which is quite alright for a χ2 distribution with 22 degrees
of freedom (p-value 0.2573). So, it is a ‘good’ model. If we try an AR(1) model
instead, all parameters are statistically significant but the value of the Portmanteau
test jumps up to 47.44, p-value 0.003. Therefore, the Portmanteau test indicates
that the residuals are have non-trivial correlations, this is not a ‘good’ model. No
surprise—cyclic behavior is related to complex roots, and complex roots may show
up if the order of autoregression is at least 2. For AR(3) model, the value of the
131
Data
9.5
9.0
8.5
8.0
Time
80 82 84 86 88 90 92
Crops Data
400
300
200
100
Years
1500 1550 1600 1650 1700 1750 1800 1850
Portmanteau test is 26.14 with p-value 0.2456, practically the same as for AR(2)
model. However, the estimate â3 = 0.0808 has p-value 0.2719, and that is the senior
coefficient. So, we have a clear evidence of over-parametrization.
If instead we treat the data as a non-stationary series, we switch to the incre-
ments of the series (Figure 18). Looking at the ACF for the increments (Figure 19),
we may want to try MA(1) model. Its PACF (Figure 20 suggests that AR(6) might
work. For a MA(1) model, we get b̂1 = 0.701 with p-value 0.0023, so the parameter
is statistically significant. As for the residuals, we get 29.14 for the Portmanteau
test, the p-value is 0.1879. So, we’ve got yet another good model ARIMA(0,1,1)
for the original series. The AR(6) model is also good, and so is AR(5) model.
However, AR(6) has a much better value of the Portmanteau test (p-value 0.7346
instead of 0.1864). So, we have three models to compare, AR(2), ARIMA(6,1,0) and
ARIMA(0,1,1). Out of them, ARIMA(6,1,0) produces somewhat better prediction.
5. To conclude, consider the Price index data (Figure 43). It is surely non-
stationary. Moreover, the magnitude of the oscillations looks proportional to the
level. For this reason, we take the logarithm (Figure 44), and after that, switch to
the increments (Figure 45).
Sample ACF of the increments (Figure 46) suggests MA(3) model. The model
looks fine, the ACF of the residuals (Figure 47) does not show any significant
correlations. However, the senior coefficient, b̂3 = 0.22 has a p-value is 0.157 which
is way too big. So this coefficient, probably, can be dropped from the model.
132
Increments
0.6
0.4
0.2
Years
1550 1600 1650 1700 1750 1800 1850
-0.2
-0.4
-0.6
-0.8
ACF
1.0
0.5
Lags
1 5 10 15 20 25 30
-0.5
-1.0
However, MA(2) model does not look good at all. Residuals show some non-
trivial correlations (at lags 3 and 8, see Figure 48), and the value of the Portmanteau
test with m = 25 is 54, which is way too much for the χ2 distribution with 23 degrees
of freedom (p-value 0.0003).
Looking at the PACF of the data instead (Figure 49), we may decide to try
autoregression of the order 4. However, the value of the Portmanteau test is 39.6,
the p-value is 0.008 and the residuals do not look like the white noise (Figure 50).
Looking at the graph of the PACF once again, we may decide to try autore-
gression of the order 8. The coefficients of the model look fine, the residuals do not
133
ACF
1.0
0.5
Lags
1 5 10 15 20 25 30
-0.5
-1.0
ACF
1.0
0.5
Lags
1 5 10 15 20 25 30
-0.5
-1.0
PACF
1.0
0.5
Lags
1 5 10 15 20 25 30
-0.5
-1.0
show any significant correlations (Figure 51), the value of the Portmanteau test is
21 with p-value 0.202. But, eight parameters to estimate look too many.
Trying mixed models, we come across ARMA(2,1) model. Its residuals do not
show any significant correlations (Figure 52), the value 28 for the Portmanteau test
is reasonable (p-value is 0.18) and all the coefficients have reasonable standard errors
(autoregression coefficients are a1 = 0.79 with standard error 0.15 and a2 = −0.33
with standard error 0.06, and the moving average coefficient b1 = 0.84 with standard
error 0.06).
134
ACF
1.0
0.5
Lags
1 5 10 15 20 25 30
-0.5
-1.0
ACF
1.0
0.5
Lags
1 5 10 15 20 25 30
-0.5
-1.0
ACF
1.0
0.5
Lags
1 5 10 15 20 25 30
-0.5
-1.0
By the way, note that the Akaike Information Criterion (AIC) assigns the best
value to another model, ARMA(8,1). Since both AR(8) and ARMA(2,1) look good,
including extra parameters into them should not be reasonable. Indeed, when we
look at the ACF for the residuals (Figure 53), we see that the first five or six values
are practically zeros, which may indicate that the model is over-parameterized.
As above, we can compare the predictions made according to the models.
They are shown on Figures 54, 55 and 56. It is not so easy to choose from, how-
ever the predictions according to ARIMA(2,1,1) model and ARIMA(8,1,1) model
135
ACF
1.0
0.5
Lags
1 5 10 15 20 25 30
-0.5
-1.0
Crops Data
400
350
300
250
200
150
Years
1850 1855 1860 1865 1870
Crops Data
400
350
300
250
200
150
Years
1850 1855 1860 1865 1870
(the over-parameterized one) look slightly better than the prediction according to
ARIMA(8,1,0) model.
Exercises
1. An ARMA(3,1) model has been fitted to some data set. All coefficients of
the model are significant. The values of the residual ACF will be posted on the
136
Crops Data
400
350
300
250
200
150
Years
1850 1855 1860 1865 1870
web. The sample size is N = 400. Are the residuals a white noise? What would
you say if N = 1600 instead.
CHAPTER 4
where ωi are some constants, Ai are independent random variables with EA2i = Fi2
and θi are i.i.d., independent from all Aj and uniformly distributed on [0, 2π]. It
could be shown that the stochastic process Xt is strictly stationary. Let’s compute
its auto-covariance function. We have
R(k) = EXt Xt+k
X
= E(Ai Aj )E(cos(ωi t + θi ) cos(ωj (t + k) + θj )
i,j
X 1
= E(Ai Aj ) [E cos(ωi t + ωj (t + k) + θi + θj )
i,j
2
+ E cos(ωi t − ωj (t + k) + θi − θj )
Now,
E cos(ωi t + ωj (t + k) + θi + θj ) = E cos(ωi t − ωj (t + k) + θi − θj ) = 0
if i 6= j (why?). If i = j, then the first expectation is still zero and the second one
reduces to
E cos(ωi t − ωi (t + k) + θi − θi ) = cos(−ωi k) = cos(ωi k)
Hence, the autocovariance function turns out to be equal to
n
1X 2
(1.2) R(k) = F cos(ωi k)
2 i=1 i
2
In particular, the total variance σX = R(0) of the process can be decomposed into
sum of the variances of the corresponding harmonics:
n
2 1X 2
σX = F
2 i=1 i
Switching to the autocorrelation function, we get
n
1 1X 2
(1.3) ρ(k) = 2 F cos(ωi k)
σX 2 i=1 i
In a sense, this example is not typical. For instance, the autocorrelation func-
tion ρ(k) does not tend to zero as k → ∞ (and that was the necessary condition
137
138
for most of the methods discussed earlier). Nonetheless, we’d like to have a way to
study a frequency structure of a stationary process. The following theorem holds.
Theorem (Wold’s representation). SupposeP∞ the autocovariance function R(k)
of the process Xt satisfies the condition k=1 |R(k)| < ∞. There exists a non-
negative continuous even function fX (ω), called the power spectral density, or just
spectral density, of the process Xt , such that
Z π
(1.4) R(k) = cos(ωk)fX (ω)dω
−π
The function fX (ω) can be recovered from the autocovariance function R(k) by the
formula
∞
1 X
(1.5) fX (ω) = [R(0) + 2 R(k) cos(ωk)]
2π
k=1
The spectral density has the property
Z π
2
fX (ω)dω = σX
−π
(just set k = 0 in (1.4)). We will drop the index X whenever it causes no confusion.
Sometimes, it is convenient to normalize the spectral density by dividing by the
∗
variance of the process. Normalized spectral density fX (ω) = σ12 fX (ω) is related
X
to the autocorrelation function:
Z π
∗
ρ(k) = cos(ωk)fX (ω)dω
−π
(1.6) ∞
∗ 1 X
fX (ω) = [1 + 2 ρ(k) cos(ωk)]
2π
k=1
For the normalized spectral density, we have
Z π
f ∗ (ω)dω = 1
−π
We may consider (1.4) as a continuous version of the (1.2) (instead of a finite
(or countable) sum, we have an integral). In fact, (1.1) is an example of a stationary
process with so-called discrete spectrum, and the above theorem describes processes
with continuous spectra. There exists a version of the theorem that allows to handle
discrete and continuous spectra at the same time, but that requires the use of the
Lebesgue-Stieltjes integral.
Remarks. 1. Why do we limit ourselves by the interval [−π, π]? Let us return
to the example (1.1). Since
cos(ωk + φ) = cos((ω + 2π)k + φ)
for all integer k, it is clear that we could assume that all the frequencies ωi belong
to, say, the interval (−π, π] (since the observations are discrete, we can’t distinguish
oscillations with frequencies that differ by a multiple of 2π).
2. Alternative version of (1.4), (1.5) and (1.6). Recall that
1
cos(ωk) = (eiωk + e−iωk ).
2
and
eiωk = cos(ωk) + i sin(ωk)
139
In a similar way,
∞
∗ 1 X
(1.9) fX (ω) = ρ(k)eiωk
2π −∞
and
Z π
∗
(1.10) ρ(k) = eikω fX (ω)dω.
−π
Speaking the language of real and complex analysis, we see that the spectral
density is the Fourier transform of the autocovariance function.
3. Since the spectral density is an even function, it looks reasonable to consider
only non-negative frequencies. However, a concept of a spectral density can be
extended to the multivariate case when we observe two (or more) series and have
to work with cross-covariance function CXY (k) = Cov(Xt , Yt+k ) and so-called cross-
spectrum. Reduction to non-negative frequencies does not work there.
4. There exists a representation of a Gaussian stationary process with zero
expectation as a mixture of random harmonics, which could be considered as a
continuous version of the representation (1.1). It is based on the concept of a
stochastic integral. We don’t discuss it here.
Spectra of some stationary processes.
1. White noise. Suppose Xt are i.i.d.Then ρ(k) = 0 for all k 6= 0 and (1.5)
becomes
1 2
fX (ω) = σ
2π X
140
So, the spectral density of the white noise is a constant, all frequencies participate
with the same magnitude (like a uniform mixture of all colors produces the white
color).
2. MA(1). Suppose now that
Xt = εt + bεt−1
where εt is the white noise and |b| < 1. We have
R(0) = (1 + b2 )σε2 , R(1) = R(−1) = bσε2
and R(k) = 0 if k = ±2, ±3, . . . . Therefore
1
(1.11) fX (ω) = σε2 (1 + 2b cos(ω) + b2 )
2π
3. AR(1). Suppose
Xt = aXt−1 + εt
where εt is the white noise and |a| < 1. It is easier to begin with the normalized
spectral density. As we know, ρ(k) = a|k| . Hence
∞
∗ 1 X
fX (ω) = (1 + 2 ak cos(ωk))
2π
k=1
In order to find a closed form, we use the alternative formula (1.9). We have
∞
∗ 1 X |k| iωk
fX (ω) = a e
2π −∞
∞ ∞
1 X X
= (1 + (aeiω )k + (ae−iω )k )
2π
k=1 k=1
1 aeiω ae−iω
= 1+ +
2π 1 − ae iω 1 − ae−iω
1 (1 − ae )(1 − ae−iω ) + aeiω (1 − ae−iω ) + ae−iω (1 − aeiω )
iω
=
2π (1 − aeiω )(1 − ae−iω )
1 1 − aeiω − ae−iω + a2 + aeiω − a2 + ae−iω − a2
=
2π 1 − aeiω − ae−iω + a2
2
1 1−a
=
2π 1 − 2a cos(ω) + a2
Since
2 σε2
σX = ,
1 − a2
the non-normalized spectral density is given by
1 1
(1.12) fX (ω) = σε2
2π 1 − 2a cos(ω) + a2
This straightforward approach is not working for a general ARMA(k, l) process
(there is no simple formula for autocovariance function). We need to find another
way to do it, and the following general result is very helpful. Suppose that two
141
stationary processes Xt and Yt with zero expectation are related to each other by
the equation
N
X
(1.13) Yt = gn Xt−n
n=−N
Spectrum
0.6
0.5
0.4
0.3
0.2
0.1
Frequency
π/4 π/2 3π/4 π
Spectrum
0.6
0.5
0.4
0.3
0.2
0.1
Frequency
π/4 π/2 3π/4 π
Spectrum
15
10
Frequency
π/4 π/2 3π/4 π
Spectrum
15
10
Frequency
π/4 π/2 3π/4 π
Spectrum
1.5
1.0
0.5
Frequency
π/4 π/2 3π/4 π
Spectrum
10
5
1
0.50
0.10
Frequency
π/4 π/2 3π/4 π
we have
|α(eiω )|2 = 1.41 + 0.6 cos ω − 0.8 cos(2ω),
which again results in two peaks, however this time the peak at zero is much smaller
than the peak at π and can only be seen in logarithmic scale (Figure 6).
Finally, consider a process
Xt = 1.8Xt−1 − 0.9Xt−2 + εt .
This time, (1.17) implies
|α(eiω )|2 = 5.05 − 6.84 cos ω + 1.8 cos(2ω).
and therefore
1 1
fX (ω) =
2π 5.05 − 6.84 cos ω + 1.8 cos(2ω)
The graph is shown on Figure 7). Recall that this AR(2) process has a clearly
visible quasi-periodic behavior (Figure II.21), which is reflected by the spectrum.
Comments and Explanation. Let us begin with MA(1) and AR(1) pro-
cesses. As above, assume σε2 = 1. Spectrum of MA(1) process
Xt = εt + bεt−1
is given by the formula
1
fX (ω) = (1 + 2b cos(ω) + b2 )
2π
145
Spectrum
150
100
50
Frequency
π/10 π/4 π/2 3π/4 π
and therefore the spectrum of the process equals, up to a factor 1/(2π), to the
product of the spectra of two AR(1) processes, one of them corresponds to the root
z1 and the other one corresponds to the root z2 . So, if both roots are positive, then
the spectrum has a peak at zero, and if both roots are negative, the spectrum has a
peak at π. If the roots are of the opposite sign, we have two peaks, one peak at zero
and the other one at π. Since the size of the peak depends on the absolute value of
the corresponding parameter, one of the peaks might look insignificant compared
to the other one. Say, for the a process
Xt = 0.1Xt−1 + 0.6Xt−2 + εt ,
we have z1 ≈ 0.826, z2 ≈ −0.726, so we have two roots of opposite sign. Since
|z1 | > |z2 |, the peak at zero has to be bigger than the peak at π (Figure 5).
For the equation
Xt = −0.5Xt−1 + 0.4Xt−2 + εt
we have z1 = 0.43, z2 = −0.93. Again, we have two roots of opposite signs. How-
ever, this time the negative root z2 is close to −1 and the positive root z1 is not
close to 1 at all. For that reason, the peak at zero is very small compared to that
at π (Figure 6).
In case of complex roots, AR(2) process has a quasi-periodic behavior and the
spectrum has a peak at a certain frequency. Namely, if z 2 + a1 z + z 2 has a pair of
complex roots z1,2 = reiθ , then the spectrum has a peak at a frequency ωmax that
is close to θ unless r is very small or θ is close to 0 or to π. Specifically, r and θ
should satisfy the condition
2r
(1.18) | cos(θ)| < .
1 + r2
Under this condition, ωmax could be found from the equation
1 + r2
(1.19) cos(ωmax ) = cos(θ).
2r
(see details below). The value of the spectrum at ωmax equals
1 1
(1.20)
2π (1 − r )2 sin2 (θ)
2
Finally, let us discuss AR(k) and ARMA(k, l) processes. Consider, for instance,
an AR(k) process
α(B)Xt = εt
z k + a1 z k−1 + · · · + ak = 0
has k roots z1 , z2 , . . . , zk , some of them may be complex. All of the roots should
satisfy the condition |zi | < 1. Since the roots of the polynomial α are reciprocal to
z1 , z2 , . . . , zk , the latter can be factorized as
α(z) = (1 − z1 z) . . . (1 − zk z)
Since complex roots come in pairs (a complex root and its conjugate), we can
therefore represent α(z) as a product
where each polynomial αi is either linear or quadratic, linear factors have the form
(1 − zj z) and correspond to real roots zj and quadratic factors have the form
(1 − zj z)(1 − zj z) and correspond to a pair of complex roots zj , zj . But then,
|α(eiω )|2 = |α1 (eiω )|2 |α2 (eiω )|2 . . . |α2 (eiω )|2
and therefore the spectrum of the AR(k) process equals, up to a 1/(2π)l−1 factor,
to the product of spectra of corresponding AR(1) and AR(2) processes. Therefore,
to every positive root there corresponds a peak at zero, to every negative root there
corresponds a peak at π and to every pair of complex roots there may correspond
a peak at corresponding frequency. Height and sharpness of peaks depend on
how big are the roots in absolute value. Roots with absolute value that is very
close to one, produce very sharp and tall peaks and the other peaks may look
insignificant. Similar arguments work for ARMA(k, l) processes. In fact, this way
we can approximate any positive continuous function.
Therefore
∞
1 X iωk X X
fY (ω) = e gn gm RX (k + n − m)
2π n m
k=−∞
∞
1 XX X
= gn gm eiωk RX (k + n − m)
2π n m
k=−∞
148
and
G(eiω )G(eiω ) = |G(eiω )|2 .
2. Verification of (1.19) and (1.20) (sketch). Let
Xt + a1 Xt−1 + a2 Xt−2 = εt
be an AR(2) process. Suppose the equation
z 2 + a1 z + a2 = 0
has a pair of complex roots z1,2 = re±iθ . We have r < 1 from the stationarity
condition. Also,
z 2 + a1 z + a2 = (z − z1 )(z − z2 )
which implies
(1.21) a1 = −2r cos(θ), a2 = r2 .
Assume that (1.18) is satisfied. Then ωmax exists and | cos(ωmax )| < 1. Now,
according to (1.17), the spectrum reaches its maximum when
(1.22) 2a1 (1 + a2 ) cos(ω) + 2a2 cos(2ω)
reaches its minimum. With help of (1.21), this becomes
(1.23) − 4r(1 + r2 ) cos(θ) cos(ω) + 2r2 cos(2ω).
Setting the derivative to zero, we get an equation
4r(1 + r2 ) cos(θ) sin(ω) − 4r2 sin(2ω) = 0
which is equivalent to the equation
sin(ω)((1 + r2 ) cos(θ) − 2r cos(ω)) = 0
So, we have three solutions to the equation, 0, π and the one is given by (1.19).
However, the first two solutions correspond to local maxima (second derivative
turns negative). Indeed, second derivative of (1.23) is equal to
4r(1 + r2 ) cos(θ) cos(ω) − 8r2 cos(2ω)
and its values at zero and at π are given by the formula
±4r(1 + r2 ) cos(θ) − 8r2 = 4r(±(1 + r2 ) cos(θ) − 2r) < 0
149
Exercises
1. Verify (1.17).
For Problems 2-12, compute and graph the spectrum of the corresponding sta-
tionary processes (simplify your answer down to real numbers, things like eiωk are
not allowed; you may find (1.16) and (1.17) helpful). Assume σε2 = 1.
2. Xt = 0.7Xt−1 + εt
3. Xt = 0.9Xt−1 + εt + 0.9εt−1
4. Xt = 1.35Xt−1 − 0.66Xt−2 + εt
5. Xt = 0.3Xt−1 + 0.54Xt−2 + εt
6. Xt = −0.05Xt−1 + 0.9Xt−2 + εt
7. Xt = 1.6Xt−1 − 0.94Xt−2 + εt
8. Xt = 1.7Xt−1 − 0.8Xt−2 + εt − 1.9εt−1 + 0.95εt−2
9*. (1 − 1.4B + 0.75B 2 )(1 + 0.9B)Xt = εt
10*. (1 − 1.9B + 0.95B 2 )(1 + 1.7B + 0.9B 2 )Xt = εt
11**. (1 + 1.35B + 1.175B 2 + 0.6375B 3 )Xt = εt
12**. (1 + 0.028B + 1.06B 2 − 0.11B 3 + 0.21B 4 )Xt = (1 + 0.24B + 0.99B 2 )εt
as an estimate for it. The function IX plays a central role in the spectral estimation
though it is not a good estimate by itself. It is called the periodogram of the time
series X.
To simplify our computations, we assume that the expectation of the series Xt
is known and is equal to zero (otherwise, we have to estimate it and subtract it
from the data). Also, we assume that the number of observations N is even (this
simplifies things a bit).
We begin with finding an alternative formula for the periodogram. It is conve-
nient to set Xt = 0 for all t 6= 1, . . . N and R̂(k) = 0 for all k such that |k| ≥ N .
Then
∞
1 X
R̂(k) = Xt Xt+k
N t=−∞
150
Periodogram
1.2
1.0
0.8
0.6
0.4
0.2
Frequency
π/4 π/2 3π/4 π
and we have
∞
!
1 X
iωk
IX (ω) = R̂(k)e
2π
k=−∞
∞ ∞
!
1 1 X
X
iωk
= Xt Xt+k e
2π N t=−∞
k=−∞
1 1 X X
(2.2) = Xt e−iωt Xt+k eiω(k+t)
2π N t
k
1 1 X −iωt
X
= Xt e Xs eiωs
2π N t s
N
1 1 X
= | Xt e−iωt |2
2π N t=1
Periodogram
0.8
0.6
0.4
0.2
Frequency
π/4 π/2 3π/4 π
Periodogram
1.5
1.0
0.5
Frequency
π/4 π/2 3π/4 π
Periodogram
1.4
1.2
1.0
0.8
0.6
0.4
0.2
Frequency
π/4 π/2 3π/4 π
are independent. If p is not zero or N/2, then ζp has 12 χ2 (2) distribution. Other-
wise, it has a χ2 (1) distribution.
In particular, we see from here that the periodogram is an unbiased estimate for
the spectral density (at least for the principal frequencies). However, the variance
of the periodogram does not depend on N and does not go to zero as N → ∞.
Therefore the periodogram is not a consistent estimate.
White noise test. The established result allows us to construct a white noise
test. Consider
I(ωp )
W = sup 2
p=1,...,N/2−1 σ̂X /2π
N/2−1
Y
P {W ≤ z} = P {ζp ≤ z} ≈ (1 − e−z )N/2−1
p=1
m
X
Yt = gk Xt−k
k=0
N N
1 X 1 X
√ Xt eiωt ≈ √ Xt−k eiω(t−k)
N t=1 N t=1
N N
1 X X
Xt−k eiω(t−k) Xt−l e−iω(t−l) ≈ IX (ω)
2πN t=1 t=1
155
(so called inverse discrete Fourier transform). Also, since Xn is real-valued, FX (−k)
coincides with complex conjugate of FX (−k).
There exist efficient algorithms that allow us to compute FX (so called fast
Fourier transform).
In terms of discrete Fourier transform, the periodogram of a time series Xt , t =
1, . . . , N could be written as
2πp 1 1
IX ( )= |FX (p)|2 , p = 0, 1, . . . , N/2
N 2π N
156
Periodogram
3000
2500
2000
1500
1000
500
Periods
80 11 5.5 3 2
Periodogram
1000
100
10
0.1
Periods
80 11 5.5 3 2
Spectrum
10
1
0.100
0.010
0.001
Frequency
π/4 π/2 3π/4 π
where FX is the discrete Fourier transform of the sequence Xt . In fact, real and
imaginary parts of FX (p) are proportional to the coefficients ap and bp given by
(2.6).
Examples and Further Comments. 1. On the following Figures 12 - 14,
you can see a periodogram of the sunspots data, and a periodogram of a series
Xt = 0.9Xt−1 + εt , together with its spectrum.
2. Sometimes, the periodogram reveals a presence of periodic components or
even the structure of the data. On Figure 15, you can see a mysterious data set.
Looking at its ACF and PACF, we may try different ARMA models (AR(4)? or
157
Data
Time
50 100 150 200 250 300 350
-2
-4
ACF
1.0
0.5
Lags
1 5 10 15 20 25 30
-0.5
-1.0
PACF
1.0
0.5
Lags
1 5 10 15 20 25 30
-0.5
-1.0
maybe MA(3)? or something mixed?) but they don’t really work. However, the
periodogram reveals the secret: It is a mixture of six periodic components (yes, just
that with no noise).
3. Another example of this type was given in the introduction (human brain
example, Figures 7 and 8 in the Introduction). Though we can’t see them in the
noise and all other brain activity, the series contains two periodic components (one
is the main electric power frequency and the other one is a resonance).
Exercises
158
Periodogram
15
10
Periods
30 10 5 4 3 2
1. For the following data set (will be posted on the web, 30 data points) com-
pute and graph the periodogram. Reminder: the periodogram should be computed
only for the principal frequencies. Suggestion: you may find the alternative formula
(2.2) to be the easiest.
2
2. For a series of length 400, estimated variance σ̂X = 3.96 and the maximum
of the periodogram equals 4.91. Is a series a white noise? What if the maximal
value of the periodogram is 6.35? Use 5 % level of confidence.
is a linear combination of cosine functions. The bigger is k, the smaller is the period
of the corresponding component. Hence, IX oscillates heavily because too much
weight has been given to the highly oscillating components, the ones with large ks.
For this reason, we may consider an estimate of the following structure
N −1
1 X
(3.1) fˆ(ω) = (R̂(0) + 2 λk R̂(k) cos(kω))
2π 1
where the coefficients λk , called the lag window, should somehow decrease as k
grows. Typically, we look for the coefficients of the following structure:
(
k
λ( M ), 0≤k≤M
(3.2) λk =
0 k>M
The function λ(·) is called a window generator, and M is the truncation point (it is
often called the window width). The function λ(x) should be decreasing, continuous
at zero, such that λ(0) = 1, λ(1) = 0. Note that λk → 1 as M → ∞. So, it looks
like the parameter M should go to infinity as N → ∞, otherwise we can’t achieve
the consistency.
159
which is similar to (3.3), with no agreement about extending the integration to the
whole real line (so we have sort of end-effects near −π and π).
The other way around, suppose that fˆ is defined by (3.3) with the spectral
window w(ω). Then
N −1
1 X
(3.7) fˆ(ω) = (R̂(0) + 2 λk cos(kω)R̂(k))
2π
k=1
160
where
Z ∞
(3.8) λk = w(z) cos(kz) dz.
−∞
(see details below).
In fact, if the spectral window w is generated by the window generator W (ω),
then the corresponding lag window is generated by the lag window generator
Z ∞
λ(x) = W (z)eixz dz
−∞
with the same parameter M . Conversely, the inverse formula for the Fourier trans-
form implies that Z ∞
1
W (z) = λ(x)e−ixz dx.
2π −∞
However, we can’t claim that the lag window generator vanishes outside of the
interval [−1, 1]; the other way around, if the lag window generator vanishes outside
[−1, 1], then the corresponding spectral window generator does not have compact
support. For this reason, these two approaches are complimentary to each other.
Since the spectral density is always non-negative, it is natural to require that its
estimate is also non-negative. This explains a certain advantage of the spectral win-
dows: If w(z) is non-negative, then the estimate fˆ(ω) automatically non-negative
(since the periodogram is non-negative, we are integrating a non-negative function
in (3.3)). A condition on the lag window which guarantees non-negativity of the
estimate, is more tricky and not natural (essentially, it says that the corresponding
spectral window is non-negative).
Many dozens of windows have been discussed. A few examples.
1. Truncated periodogram. It is defined by the lag window
(
1 if |k| ≤ M
λk =
0 otherwise
Corresponding spectral window is equal to
1 sin((M + 1/2)θ)
w(θ) =
2π sin(θ/2)
(it is called the Dirichlet kernel. It is not non-negative, so the corresponding esti-
mate may be negative somewhere).
2. Bartlett, or triangular, window. Again, a lag window given by the
formula
(
1 − |k|
M if |k| ≤ M
(3.9) λk =
0 otherwise
It corresponds to a non-negative spectral window
2
1 sin(M θ/2)
(3.10) w(θ) =
2πM sin(θ/2)
(so called Fejer kernel of the order M ).
3. Rectangular, or Daniell, window. This is a spectral window of the form
(
M π
if |θ| ≤ M
w(θ) = 2π
0 otherwise
161
where 0 < a ≤ 1 is a parameter, typically set to 0.25 (so called Tukey-Hanning win-
dow). The corresponding spectral window is a linear combination of the Dirichlet
kernels (see example 1) and it is not non-negative.
5. Parzen window. Again, a lag window with the window generator given
by the formula
2
1 − 6x + 6|x| ,
3
|x| ≤ 12
1
λ(x) = 2(1 − |x|)3 , 2 ≤ |x| ≤ 1
0 |x| > 1
The weights λk could be found from (3.2). The function λ(x) is twice differentiable
(as we’ll see, this is an advantage) and the corresponding spectral window is non-
negative.
6. Bartlett-Priestley window. This is a spectral window with the window
generator
( h i
3 θ2
1 − 2 if |θ| ≤ π
W (θ) = 4π π
0 otherwise
(remember, it has to be re-scaled with the window width M ).
How can we compare the windows, that is, how can we compare the corre-
sponding estimates? Also, how should we choose the window width M ? In order
to be able to say something about it, we have to study the bias and the variance
of the estimates.
Since the lag window and the spectral window approaches are (nearly) equiva-
lent, we consider an estimate (3.3) with the spectral window generator W (ω). We
assume that it is even and non-negative and it satisfies the condition
Z
W (θ) dθ = 1
Also, we will assume that the spectral density of the process fX (ω) is twice con-
tinuously differentiable. If this is the case, one can show that the bias b(ω) of the
estimate is approximately equal to
f 00 (ω) π 2
Z
(3.11) b(ω) = E fˆ(ω) − f (ω) ≈ X 2 θ W (θ) dθ
2M −π
162
AR4 Data
50
0 Time
600 650 700 750
-50
The variance of the estimate, v 2 (ω), could be found from the formula
M π
Z
(3.12) 2 ˆ 2
v (ω) = Var f (ω) ≈ 2πfX (ω) W 2 (θ) dθ
N −π
Periodogram
6000
5000
4000
3000
2000
1000
0 Periods
20 9.05 5 3 2
Spectrum
1400
1200
1000
800
600
400
200
Periods
20 9.05 5 3 2
Spectrum
1400
1200
1000
800
600
400
200
Periods
20 9.05 5 3 2
Figure 22. The same thing with M = 100. Now we can see both
peaks (though the peak on the right is not as tall as it should be.
where
N
ν= R .
πM W 2 (ω)dω
164
Spectrum
1400
1200
1000
800
600
400
200
Periods
20 9.05 5 3 2
Figure 23. The same thing with M = 250. Peaks are perfect,
but the estimate is no longer a smooth function, so M is probably
a bit too large.
Spectrum
1000
100
10
0.1
Periods
80 11 5.5 3 2
Spectrum
1000
100
10
1
0.1
Periods
30 10 5 4 3 2
Spectrum
1000
100
10
1
0.1
Periods
30 10 5 4 3 2
Figure 26. The same thing with M = 100 looks much better.
describe its properties. However, the smaller is theR first of them, the bigger is the
second and the other way around. Indeed, since W (θ) dθ = 1, the first of the
integrals is small if W is concentrated near zero. But then, W must be large in
a neighborhood of zero and the integral of W 2 must be also large. The following
table confirms this conclusion.
window bias × M 2 /f 00 (ω) variance × N/[M f 2 (ω)]
π2
Daniell 6 ≈ 1.645 1
Parzen 6 0.539285
π2
Tukey 4 ≈ 2.467 3/4
π2
Bartlett-Priestley 10 ≈ 0.987 6/5
4. Bandwidth. A typical goal of the spectral analysis is an interpretation
and/or explanation of the peaks/troughs in the spectrum. Hence, the goal of spec-
tral estimation is to estimate the shape of the spectrum rather than estimate a
value for any particular frequency.
As we see from (3.11), the bias of the estimate is proportional to the second
derivative of the spectral density. If the spectrum is nearly flat, the bias is small,
if it has sharp peaks or narrow troughs, then the bias is large at those points.
Suppose the spectral density has a peak at the frequency ω0 . Consider the
points ω1 < ω0 < ω2 such that f (ω1 ) = f (ω2 ) = 21 f (ω0 ) (so called half-power
points). The distance Bh (ω0 ) = ω2 − ω1 is called the spectral bandwidth of the
166
The Peak
800
600
400
200
Frequency
ω1 ω2 π/2 π
The Trough
1000
800
600
400
200
Frequency
ω1 ω2 π/2 π
What happens to the spectral density when we estimate it by using any partic-
ular window? We begin with the example. Let w(ω) be the Daniel, or rectangular,
window w(ω) = 2πM π
on (− M ,Mπ
). The estimate fˆ(ω) is then the average of the
periodogram over the interval of the width 2π
M . Assuming that the expectation of
the periodogram is approximately equal to the spectral density, and ignoring the
167
we get
∞
2 X 2
f 00 (ω) = − k R(k) cos(kω),
2π
k=1
So, if R(k) decays rapidly, f 00 is relatively small and the bandwidth is large. Oth-
erwise, bandwidth is small and the spectrum contains sharp peaks and/or narrow
troughs.
Details. 1. Derivation of (3.7). Indeed, (1.4) implies that
Z π
R̂(k) = IX (θ) cos(kθ)dθ
−π
Z π
= IX (θ)eiθk dθ
−π
Therefore
N −1
1 X
fˆ(ω) = (R̂(0) + 2 λk R̂(k) cos(kω)
2π 1
N −1
1 X
= λk R̂(k)e−iωk
2π
k=−(N −1)
N −1 Z π
1 X
= λk e−iωk IX (θ)eiθk dθ
2π −π
k=−(N −1)
Z π N −1
1 X
= λk e−i(ω−θ)k IX (θ)dθ
−π 2π k=−(N −1)
Z π
= w(ω − θ)IX (θ)dθ
−π
168
where Z ∞
λk = w(z)eikz dz.
−∞
Since w is even, Z ∞
w(z) sin(kz) dz = 0
−∞
and therefore Z ∞
λk = λ−k = w(z) cos(kz) dz.
−∞
Also, Z ∞
λ0 = w(z) dz = 1
−∞
as promised.
3. Derivation of (3.11) and (3.12) (sketch). To simplify the presentation,
we assume that the expectation of the series is equal to zero and this is known to
us. We begin with the expectation of the periodogram IX (ω). By (III.2.4), we have
Z π
|k| |k|
E R̂(k) = (1 − )R(k) = (1 − ) eikθ fX (θ) dθ
N N −π
169
and therefore
(N −1)
1 X
EIX (ω) = E R̂(k)e−ikω
2π
k=−(N −1)
(N −1)
1 X
= E R̂(k)e−ikω
2π
k=−(N −1)
π (N −1)
|k|
Z
1 X
= e−ik(ω−θ) (1 − )fX (θ) dθ
2π −π k=−(N −1) N
Now,
0 ϕ2 00
fX (ω − ϕ) ≈ fX (ω) − ϕfX (ω) + f (ω)
2 X
Hence the bias
Z π
b(ω) = E fˆX (ω) − fX (ω) = fX (ω − ϕ)FN ? wN (ϕ) dϕ − fX (ω)
−π
Z π
= (fX (ω − ϕ) − fX (ω))FN ? wN (ϕ) dϕ
−π
π
ϕ2 00
Z
0
≈ (−ϕfX (ω) + f (ω))FN ? wN (ϕ) dϕ
−π 2 X
170
(since both kernels FN and wN are even). Also, it could be shown that
Z π Z π
1
ϕ2 FN ? wN (ϕ) dϕ ≈ 2 ϕ2 W (ϕ) dϕ
−π M −π
which can be evaluated if we do know the covariance of IX (θ1 ) and IX (θ2 ). Unfor-
tunately, corresponding derivations take several pages.
Exercises
1. For the following data set (will be posted on the web, 100 data points),
estimate and graph the spectral density (a) using the truncated periodogram with
M = 10; (b) triangular window with M = 10; (c) Parzen window with M = 10.
2. Repeat the same with M = 20. Which of the estimates (out of both
problems) looks more reasonable to you?
4. Estimation Details
Precision of the estimate and comparison of the windows. One of
the possible measures of the precision of the estimate is called the mean square
percentage error. It is defined by the formula
h i
(4.1) η 2 (ω) = E{fˆ(ω) − f (ω)}2 /f 2 (ω) = {v 2 (ω) + b2 (ω)}/f 2 (ω)
Denote Z ∞
IW = 2π W 2 (ω) dω
−∞
and
∞ 1/2
√
Z
BW = 12 ω 2 W (ω) dω
−∞
According to (3.11) and (3.12),
M
v 2 (ω) = fX
2
(ω) IW
N
and
f 00 (ω) BW
2
b(ω) =
2M 2 12
Hence
M 1 BW4
(f 00 (ω))2 M BW 4
η 2 (ω) = IW + = I W +
N 4M 4 144f 2 (ω) N 576Bh (ω)4 M 4
171
where Bh (ω) is the spectral bandwidth defined by (3.13). Since the overall spectral
bandwidth Bh is the minimum of Bh (ω), we have
4
M BW
(4.2) max η 2 (ω) = IW +
ω N 576Bh4 M 4
Since M is at our disposal, let’s find the minimum of the right side in M . Differ-
entiating with respect to M , we get
4
IW 1 BW
= M −5 = 0
N 144 Bh
which yields
4/5 1/5
BW N −1/5
(4.3) M= IW
Bh 144
and, finally,
4/5
BW IW
(4.4) max η 2 (ω) ≈ 0.463N −4/5
ω Bh
The formula (4.4) gives an upper bound for the precision of the estimate. It depends
on the sample size N , on the spectral bandwidth Bh and on the spectral window.
The value BW IW characterizes the efficiency of the window (the smaller it is, the
better precision can be achieved for the same sample size N and the same spectral
bandwidth). Comparing the efficiency of the different windows, we get the following
table.
window Bandwidth variance × N/[M f 2 (ω)] efficiency
Daniell 2π 1 6.2832
Parzen 12 0.539285 6.48
Tukey-Hanning 2.45π 3/4 5.7715
Bartlett-Priestley 1.55π 6/5 5.8403
However, the Tukey-Hanning window is not non-negative (it may lead to negative
estimates of the spectral density, as we can see on Figure 32). If we limit ourselves
by the non-negative windows, we see that the Bartlett-Priestley window has the
best efficiency. In fact, it could be shown theoretically that the Bartlett-Priestley
window is most efficient in a class of non-negative spectral windows.
Solving for the sample size N , we see that
BW IW
N ≥ 0.463η −5/2
Bh
which gives us a lower bound for the sample size if we wish to achieve given precision
and are willing to handle spectral densities with the spectral bandwidth Bh or more.
Lag windows and leakage. At the early days, lag windows (truncated pe-
riodogram, triangular, Parzen etc.) were heavily used. Later, frequency windows
were suggested. Lag windows are easier to use, we need only the values of the
autocovariance function up to the lag M . In order to use the frequency window,
a periodogram is needed. As we have seen, lag windows can be transformed into
frequency windows and vice versa. However, there is an important difference. If the
lag window vanishes after some lag M , then the corresponding frequency window
does not have any truncating point, and vice versa. For instance, let’s consider
a triangular window (3.9). Corresponding frequency window (3.10) is given by a
172
80
60
40
20
Figure 29. Fejer’s kernel that corresponds to the triangular lag window.
Periodogram
8000
6000
4000
2000
Periods
10 9 8 7 6
Fejer’s kernel (Figure 29). In addition to the primary peak at zero, Fejer’s window
has small secondary peaks.
The value of the estimate at a frequency ω is therefore a weighted average of
the values of the periodogram around ω with weights given by the Fejer’s kernel.
Suppose the spectrum (hence the periodogram) contains a narrow peak at the
frequency ω0 . As we move away from ω0 , the peak in the periodogram will move
from the main peak of the Fejer’s kernel to a secondary one (Figure 30), producing
a small but visible peak which does not exist in the original spectrum. Such an
effect is called a leakage. For a triangular window, the leakage is significant (Figure
31), but other lag windows (Parzen, Tukey) also have this effect, as you can see on
Figure 25 for Parzen window and on Figure 32 for Tukey window.
Differencing and pre-whitening. Suppose the actual spectral density of
the series has a heavy peak at zero. Since low frequencies correspond to long
periods, such things happen if the series is nearly non-stationary. In such a case,
a periodogram also has a sharp peak near zero. No matter which spectral window
we are using, we will get a not-so-sharp peak at zero (wide though not that high),
so that our estimate will be significantly biased in a neighborhood of zero. Here is
a trick which may help to fight off that bias. It is called differencing. Consider the
increments
Yt = Xt − Xt−1
173
Spectrum
1000
100
10
0.10
Periods
20 8.5 5 3 2
Spectrum
10
Periods
30 10 5 4 3 2
According to (1.14),
fY (ω) = fX (ω)|1 − e−iω |2 = fX (ω)((1 − cos ω)2 + sin2 ω)
ω
= fX (ω)(2 − 2 cos ω) = fX (ω)4 sin2
2
Since sin2 ω2 has a second order zero at zero, fY should not have any peak at zero.
Hence, fY should be easier to estimate. Finally, we set
fˆY (ω)
fˆX (ω) =
4 sin2 ω2
Differencing could be considered as a special case of pre-whitening. According
to the method, we are looking for a transformation
Yt = α(B)Xt = Xt + a1 Xt−1 + · · · + ak Xt−k
such that the spectrum of Yt is nearly flat (so, Yt is, nearly, a white noise). The
spectrum of Yt should be easy to estimate (since it is flat, there is little or no bias,
we can use small M and therefore have a small variance). However, according to
174
Spectrum
5
0.500
0.050
0.005
Periods
30 10 5 4 3 2
or (sometimes) as
Z ω
∗
FX (ω) = fX (θ) dθ
−π
2
In both cases, we get a non-negative increasing function such that F (π) = σX .
2
Dividing by σX , we are getting a normalized integrated spectrum
Z ω
2
H(ω) = 2 fX (θ) dθ
σX 0
For a white noise, H(ω) = ω/π.
175
Spectrum
5
0.500
0.050
0.005
Periods
30 10 5 4 3 2
Periodogram
1.0
0.8
0.6
0.4
0.2
Periods
30 10 5 3 2
In fact, integrated spectrum can be defined for processes with discrete spectra
as well, like in the example at the very beginning of Section 4.1; if the spectrum
is continuous, then the integrated spectrum is differentiable. For the discrete spec-
trum, the corresponding integrated spectrum is a step function.
A natural estimate for the (normalized) integrated spectrum is the integrated
periodogram
P
ω ≤ω I(ωp )
Ĥ(ω) = Pp
p I(ωp )
(Note that p I(ωp ) = σ̂ 2 is actually an estimate for the variance of the series.)
P
It could be shown that, under mild assumptions, the integrated periodogram
is an unbiased and consistent estimate for the integrated spectrum.
Integrated periodogram for a white noise. Suppose Xt is a white noise.
Consider
p ω
γ = max( N/2|Ĥ(ω) − |)
ω π
176
Periodogram
1.0
0.8
0.6
0.4
0.2
Periods
80 11 5.5 3 2
Note that
ω 1 X
Ĥ(ω) − = 2 (I(ωp ) − 1)
π σ̂X
ωp ≤ω
and random variables I(ωp ) − 1 are independent identically distributed with zero
mean. Hence Ĥ(ω) − ωπ behaves like sums of independent identically distributed
random variables conditioned to get to zero at p = N/2 (which corresponds to π).
Using the advanced tools of stochastic processes (so called Brownian bridge), we
can compute the distribution of γ. Namely,
r ∞
N ωp X 2 2
P {max |Ĥ(ωp ) − | ≤ a} ≈ ∆[2] (a) = (−1)j e−2a j
p 2 π −∞
The function ∆[2] is not easy to compute. However, it had to be done only once,
and somebody did that for us. In particular,
This way, we get the following way to test if the series is actually a white noise.
We compute γ and compare it with a desired percentile for ∆[2] . If γ exceeds the
percentile, we reject the white noise hypothesis. This test works well in situations
when the spectral density does not have significant peaks, but the total weight of,
say, low frequencies is significantly more than it supposed to be.
Fast Fourier transform. A straightforward computation of the periodogram
requires N 2 multiplications (for each particular frequency, we need 2N operations
Xt eiωt , and we have N/2 principal frequencies.
P
in order to compute
However, suppose that N = rs can be factored. Every t = 0, . . . , N − 1 can be
uniquely represented as t = rt1 +t0 where 0 ≤ t1 ≤ s−1, 0 ≤ t0 ≤ r −1 (actually, t1
is the biggest integer such that t1 ≤ t/r). In a similar way, every p = 0, . . . , N − 1
can be uniquely represented as p = sp1 + p0 where 0 ≤ p1 ≤ r − 1, 0 ≤ p0 ≤ s − 1.
177
Periodogram
1.0
0.8
0.6
0.4
0.2
Periods
30 10 5 3 2
Now,
N
X N
X −1
d(ωp ) = Xt eiωp t = eiωp Xt+1 eiωp t
t=1 t=0
XX
iωp
=e Xrt1 +t0 +1 e2πip(rt1 +t0 )/N
t0 t1
X X
iωp
=e e2πipt0 /N Xrt1 +t0 +1 e2πiprt1 /N
t0 t1
Note that
e2πiprt1 /N = e2πip1 srt1 /N e2πip0 rt1 /N = e2πip0 rt1 /N
since rs/N = 1 and e2πiN = 1 for an integer N . Therefore
X
a(p0 , t0 ) = Xrt1 +t0 +1 e2πiprt1 /N
t1
Project.
(a) Evaluate (numerically) the spectral bandwidth for each of the following
processes
Xt − 1.8Xt−1 + 0.9Xt−2 = εt
Xt − 3Xt−1 + 4.02Xt−2 − 2.808Xt−3 + 0.864Xt−4 = εt
(1 + 0.028B + 1.06B 2 − 0.11B 3 + 0.21B 4 )Xt = (1 + 0.24B + 0.99B 2 )εt
178
(b) We expect the spectral bandwidth of the process to be about 0.03. We would
like to achieve maxω η 2 (ω) ≤ 0.1 where η 2 (ω) is the mean square percentage error
(4.1). Assuming we are using the Bartlett window, what should be the smallest
sample size? What is the optimal window width?
CHAPTER 5
Filters
1. Classification of Filters
Suppose two series Xt and Yt are related to each other by the formula
X
(1.1) Yt = bl Xt−l ,
l
or, more generally, by the formula
X X
(1.2) ak Yt−k = bl Xt−l
k l
(all sums should contain a finite number of non-zero terms). Treating Xt as an
input and Yt as an output, we call the relation (1.1) (or (1.2)) a filter.
In terms of back shift operator B, we Pcan rewrite (1.1) asPYt = β(B)Xt and
(1.2) as α(B)Yt = β(B)Xt where α(z) = k ak z k and β(z) = l bl z l .
From the practical point of view, (1.1) gives us the output right away whereas
(1.2) is an equation which has to be solved for Yt . Filters of the second type
are called recursive. For the recursive filters, the coefficients ak should vanish for
negative k (the same condition applies to bl though it is less important). Even
more, the polynomial α(z) should satisfy the stationarity condition: all roots of
α(z), complex or real, should be greater than one in absolute value.
P Supposel
Xt and Yt are stationary processes related by (1.1). Denote β(x) =
b
l l x . According to (IV.1.14), their spectral densities are related to each other
by the formula
fY (ω) = |β(eiω )|2 fX (ω)
The factor
2
T (ω) = β(eiω )
(1.3)
is called the transfer function of the filter. If the filter is recursive, that is, if it is
given by (1.2) rather than (1.1), then
β(eiω ) 2
fY (ω) =
fX (ω),
α(eiω )
and the transfer function is equal to
β(eiω ) 2
(1.4) T (ω) =
α(eiω )
A rather typical interpretation of the setup (1.1) and (1.2) is as follows. Suppose
Xt = St + Nt where St is a signal and Nt is a noise. Assume that the noise and
the signal are independent. Also, let us treat St and Nt as stationary processes
with spectral densities fS (ω) and fN (ω). Since S and N are independent, the
autocovariance function RX (k) = Cov(St + Nt , St+k + Nt+k ) = Cov(St , St+k ) +
179
180
1.0
0.8
0.6
0.4
0.2
π /4 π /2 3π /4 π
15
10
π /4 π /2 3π /4 π
Computation reveals
T (ω) = 4(1 − cos ω)2 .
Graph of this function is shown on Figure 2. As we can see, the filter (1.6) suppresses
low frequencies and amplifies high frequencies. This function does not approximate
a transfer function of any of the ideal filters, so it is not a filter of any of the above
types.
Moving Average as a filter. We have seen the relation of the type (1.1)
when we have discussed the method of moving averages in Section I.2. According
to (I.2.1), an estimate fˆ(t) for the trend was constructed as
l
1 X
fˆ(t) = Yt = Xt−k
2l + 1
k=−l
182
1.0
0.8
0.6
0.4
0.2
1.0
0.8
0.6
0.4
0.2
We can consider this relation as a filter (1.1). Let us find its transfer function.
Corresponding function β(x) equals
l
1 X 1
β(x) = xk = x−l (1 + x + x2 + · · · + x2l )
2l + 1 2l + 1
k=−l
1 1 − x2l+1
= x−l .
2l + 1 1−x
So,
1 |1 − e(2l+1)iω |2
(1.7) T (ω) = 2
|e−ilω |2
(2l + 1) |1 − eiω |2
Now, for a real number z, |eiz | = 1 and
|1 − eiz |2 = (1 − cos z)2 + (− sin z)2 = 2 − 2 cos z.
For these reasons, (1.7) boils down to
1 1 − cos((2l + 1)ω)
(1.8) T (ω) = .
(2l + 1)2 1 − cos(ω)
1
L’Hôpital’s rule implies that T (0) = 1. On the other hand, T (π) = (2l+1) 2 is close
1.0
0.8
0.6
0.4
0.2
them as an approximation for the low pass filter. However, it is not quite clear
what should we name as a cut-off point for this filter. For instance, we may notice
that T (π/(2l + 1)) = 0 and that is the smallest positive zero of this function. So,
ω0 = π/(4l + 2) could be considered as a cut-off point. The value of the transfer
function at ω0 is approximately equal to 2/π 2 . However, we have very little control
over the cut-off point, and we don’t have any means to make the transfer function
to approximate the “ideal” 0-1 transfer function of a low pass filter. We should
look for better options. Namely, we definitely need full control of the cut-off point.
Also, we should be able to approximate the transfer function of the ideal filter as
well as we want, something like shown on Figure 5.
First order low pass sine filter. Suppose
(1.9) Yt − aYt−1 = (1 − a)Xt
where a > 0 is a parameter. Since (1.9) defines a recursive filter, the stationarity
condition implies that a < 1. According to (1.14),
2
(1 − a)2
(1 − a)
fY (ω) = fX (ω) = fX (ω)
1 − a cos ω − ia sin ω 1 + a2 − 2a cos(ω)
and therefore
(1 − a)2
T (ω) =
1 + a2 − 2a cos(ω)
Note that 1 + a − 2a cos ω ≥ 1 + a − 2a = (1 − a)2 and therefore T (ω) ≤ 1. Since
2 2
As we can see, the function T is decreasing, T (0) = 1 and T (ω0 ) = 1/2. Let us find
out what happens to T (ω) if we increase n, say, send it to infinity. If 0 < ω < ω0 ,
then sin(ω/2) < sin(ω0 /2) and therefore
2n
sin(ω/2)
→0
sin(ω0 /2)
as n → ∞. If, however, ω0 < ω < π, then sin(ω/2) > sin(ω0 /2) and therefore
2n
sin(ω/2)
→∞
sin(ω0 /2)
Therefore T (ω) → 1 if ω < ω0 and T (ω) → 0 if ω > ω0 , that is, T (ω) approximates
the transfer function of the ideal low pass filter with cut-off point ω0 . The quality
of approximation depends on the order of the filter. If we graph transfer functions
of sine filters of different orders with the same ω0 , we get a picture similar to Figure
5. For the above reasons, the point ω0 is called the cut-off point of the sine filter.
Solving (1.11) for a, we can construct a first order filter with given cut-off point ω0 .
However, designing an n-th order filter is another story.
If we begin with the equation
(1.14) Yt + aYt−1 = (1 − a)Xt
instead of (1.9), we get a high pass filter with transfer function
1
T (ω) = 4a ω ,
1+ (1−a)2 cos2 2
so called Butterworth cosine filter of the order 1. Its cut-off point could be found
2 √
from the equation cos2 ω20 = (1−a)
4a which again has a solution if a > 3 − 2 2.
Instead of sine and cosine filters, we will focus on so called Butterworth tangent
filters which are much easier to design. A tangent low pass filter of the order n is
a filter with the transfer function
1
(1.15) T (ω) = 2n
tan(ω/2)
1 + tan(ω 0 /2)
Once again, we can see that T (ω) is decreasing, T (0) = 1 and T (ω0 ) = 1/2. Also,
if n → ∞, then T (ω) → 1 if ω < ω0 and T (ω) → 0 if ω > ω0 , same as for the sine
filters.
There is one important difference between sine/cosine filters and tangent filters.
Sine filter of the order n has the structure
α(B)Yt = Xt
185
1.0
0.8
0.6
0.4
0.2
0.100
0.001
10-5
10-7
10-9
10-11
where the polynomial α(B) has the degree n. Tangent filter of the order n has the
structure
α(B)Yt = β(B)Xt
where both polynomials α(B) and β(B) have the degree n.
At low frequencies, transfer functions of the sine filter and tangent filter prac-
tically coincide. However, there is a difference at high frequencies. For a sine filter,
1
T (π) = 2n
1 + sin(ω10 /2)
though for a tangent filter T (π) = 0. So, tangent filters do a better job eliminating
undesired highly oscillating components. On Figure 6, you can see transfer functions
for the sine filter and for the tangent filter, both of them of the order 5, with
the same cut-off point π/10. The difference between them can be clearly seen in
logarithmic scale (Figure 7).
Designing filters with given transfer function will be discussed in the next sec-
tion.
Exercises
186
For the following filters, find their transfer functions and graph them. Decide
if the filter could be called a low pass filter? a high pass filter? a band pass filter?
a band reject filter? If so, what is, approximately, the cut-off point(s)?
1. Yt = 0.25Xt−1 + 0.5Xt + 0.25Xt+1 .
2. Yt − 0.5Yt−1 = Xt + Xt−1 .
3. Yt = 0.5(Xt − Xt−1 ).
4. Yt = 0.25Xt − 0.5Xt−1 + 0.25Xt−2 .
5. Yt − 0.727Yt−1 = 0.137(Xt + Xt−1 )
6. Yt = 0.5(Xt + Xt−1 )
7. Yt − 1.561Yt−1 + 0.641Yt−2 = 0.02(Xt + 2Xt−1 + Xt−2 )
8. 3.414Yt + 0.586Yt−2 = Xt + 2Xt−1 + Xt−2
9. Yt + 0.414Yt−1 = 0.293(Xt − Xt−1 )
9. Yt + 0.577Yt−1 = 0.211(Xt − Xt−1 )
10. Yt + 0.943Yt−1 + 0.333Yt−2 = 0.098(Xt − 2Xt−1 + Xt−2 )
11. Yt + 1.28Yt−1 + 0.478Yt−2 = 0.0495(Xt − 2Xt−1 + Xt−2 )
12. 2Yt − 1.414Yt−1 = Xt − Xt−2
13. 2Yt − 1.414Yt−1 = Xt − 1.414Xt−1 + Xt−2
14. Yt − 0.607Yt−1 + 0.51Yt−2 = 0.245(Xt − Xt−2 )
15. Yt − 0.607Yt−1 + 0.51Yt−2 = 0.755(Xt − 0.805Xt−1 + Xt−2 )
16. Yt + 1.051Yt−1 + 0.649Yt−2 = 0.175(Xt − Xt−2 )
17. Yt + 1.051Yt−1 + 0.649Yt−2 = 0.825Xt + 1.051Xt−1 + 0.825Xt−2
α(B)Yt = β(B)Xt
So, the roots of the polynomials define the filter up to unknown quotient an /bm (a
proportional change of all of the coefficients does not change the filter). However,
the maximal value of the transfer function should be equal to one. Depending on
the type of the filter, this leads to a normalization condition which allows us to find
the value |an /bm |. For instance, for the low pass filters, as well as for band reject
filters, T (0) = 1, which is equivalent to the property |α(1)| = |β(1)|. (In fact, it is
quite natural to assume that, if the input is a constant, then the output must be
the same constant, which means α(1) = β(1), without absolute values). For the
high-pass filters, T (π) = 1, which is equivalent to |α(−1)| = |β(−1)|. For the band
pass filters, we actually have to find a point where T (ω) reaches its maximum, and
set that value to one.
187
β(z)
Denote G(z) = α(z) . Let us consider T (ω) = T̃ (eiω ) as a function on a unit
circle. On the unit circle,
1
T̃ (eiω ) = |G(eiω )|2 = G(eiω )G(e−iω ) = G(eiω )G( )
eiω
Hence the function T̃ (z) coincides with an analytic function G(z)G(1/z) which is
therefore an analytic continuation of T̃ (z). By the properties of analytic functions,
an analytic continuation is unique (if it exists at all).
Now, α(z) must satisfy the stationarity condition. Therefore its roots z1 , . . . , zn
must be outside of the unit circle. In order to find the roots of α(z), let us find the
poles of T̃ (z), that is the roots of T̃ −1 (z). Clearly, T̃ −1 (z) = G−1 (z)G−1 (1/z) = 0
if and only if α(z) = 0 or α(1/z) = 0. Hence T̃ (z) has 2n poles z1 , . . . , zn and
z1−1 , . . . , zn−1 , of which the first n are outside of the unit circle and the other n are
inside (so we can easily decide which roots are to be used).
Let us now find zeroes of T̃ (z) = G(z)G(1/z). Clearly, G(z)G(1/z) = 0 if and
only if β(z) = 0 or β(1/z) = 0. So T̃ (z) has 2m zeros, namely w1 , . . . , wm and
w1−1 , . . . , wm −1
. So, the roots come in pairs w1 , w1−1 , . . . , wm , wm
−1
. Out of each pair,
only one root is to be used. It looks like we have some flexibility here. However, in
most cases, wi = ±1.
Hence, in order to implement a filter with a given transfer function T (ω), we
should do the following.
Step 1. Consider T (ω) as a function on the unit circle and find its analytic
continuation T̃ (z) (it is unique if exists at all).
Step 2. Find all the roots and all the poles of T̃ (z). Choose those poles that
are outside the unit circle. Select half of the zeroes to be used.
Step 3. Find the quotient an /bm from the normalization condition.
Low pass Tangent Filters. Practical realization of this program for the
Butterworth low pass tangent filters is not so difficult. We begin with the formula
1 eiω/2 − e−iω/2 1 eiω − 1 1 − eiω
tan(ω/2) = = = i
i eiω/2 + e−iω/2 i eiω + 1 1 + eiω
Denote A = tan(ω0 /2). Replacing eiω by z, we easily get an analytic continuation
for T̃ (z):
A2n
T̃ (z) = 2n .
A2n + (−1)n 1−z
1+z
First order low pass tangent filter. We have n = 1 and the equation (2.1)
implies
2
1−z
= A2
1+z
Assume for now that ω0 6= π/2, so that A 6= 1. We have
1−z 1∓A
= ±A, z=
1+z 1±A
1+A
Now, A = tan(ω0 /2) is positive and therefore the root z1 = 1−A is outside the unit
circle. Hence
α(z) = a1 (z − z1 ), β(z) = b1 (1 + z)
Finally, the normalization condition yields
b1 1 − z1 A
a1 (1 − z1 ) = 2b1 , = =
a1 2 A−1
A
Setting a1 = 1, we get b1 = A−1 , so the filter should be
1+A A
− Yt + Yt−1 = (Xt + Xt−1 )
1−A A−1
or
(2.2) (A + 1)Yt + (A − 1)Yt−1 = A(Xt + Xt−1 )
This formula still works if ω0 = π/2 and A = 1, though the filter becomes non-
recursive.
Other way around, suppose the filter is given by (2.2). Then
|α(eiω )|2 = |(A + 1) + (A − 1) cos ω + i(A − 1) sin ω|2
= (A + 1)2 + 2(A + 1)(A − 1) cos ω + (A − 1)2 cos2 ω + (A − 1)2 sin2 ω
= 2A2 + 2 + 2(A2 − 1) cos ω = 2(A2 (1 + cos ω) + (1 − cos ω))
In a similar way,
|β(eiω )|2 = |A((1 + cos ω) + i sin ω)|2
= A2 (1 + 2 cos ω + cos2 ω + sin2 ω) = 2A2 (1 + cos ω).
Hence the transfer function
A2 (1 + cos ω) 1 1
T (ω) = 2 = 1 1−cos ω = tan2 (ω/2)
A (1 + cos ω) + (1 − cos ω) 1+ A2 1+cos ω 1+ A2
π π
Example. Let ω0 = 6. Then A = tan( 12 ) ≈ 0.268 and (2.2) becomes
1.268Yt − 0.732Yt−1 = 0.268(Xt + Xt−1 ).
If we wish, we can divide all coefficients by the first one, and get the equation
Yt − 0.577Yt−1 = 0.211(Xt + Xt−1 ).
Second order low pass tangent filter. The equation (2.1) becomes
4
1−z
= −A4
1+z
and therefore √ √
1−z 2 2
= A(± ±i )
1+z 2 2
189
(all four combinations of signs are possible). We need to choose two of them. As
we know, if w = 1−z 1−w
1+z , then z = 1+w and |z| > 1 if and only if |w − 1| > |w + 1|,
that is if and only if w belongs to the left half plane. With this in mind, we can
find
1 √
z1,2 = √ (1 − A2 ± 2Ai)
1 − 2A + A2
Note that √ √
(1 − 2A + A2 )(1 + 2A + A2 ) = 1 + A4 .
Therefore
α(z) = a2 (z − z1 )(z − z2 ) = a2 (z 2 − (z1 + z2 )z + z1 z2 )
2(1 − A2 ) 1 + A4
= a2 (z 2 − √ z+ √ )
1 − 2A + A2 (1 − 2A + A2 )2
√
2 2(1 − A2 ) 1 + 2A + A2
= a2 (z − √ z+ √ )
1 − 2A + A2 1 − 2A + A2
Next,
β(z) = b2 (z + 1)2 = b2 (z 2 + 2z + 1)
and we can find the quotient a2 /b2 from the normalization condition α(1) = β(1).
We have β(1) = 4b2 and
2(1 − A2 ) 1 + A4
α(1) = a2 (1 − √ + √ )
1 − 2A + A2 (1 − 2A + A2 )2
√ √
(1 − 2A + A2 )2 − 2(1 − A2 )(1 − 2A + A2 ) + 1 + A4
= a2 √
(1 − 2A + A2 )2
√
4A2 − 4 2A3 + 4A4 A2
= a2 √ = 4a2 √
(1 − 2A + A2 )2 1 − 2A + A2
After all cancelations, the normalization condition α(1) = β(1) reads
√
A2 a2 = (1 − 2A + A2 )b2 .
√
Choosing a2 = 1 − 2A + A2 and b2 = A2 , we end up with the following expression:
√ √
(1 + 2A + A2 )Yt − 2(1 − A2 )Yt−1 + (1 − 2A + A2 )Yt−2
(2.3)
= A2 (Xt + 2Xt−1 + Xt−2 ).
√
Example. Let ω0 = π3 . Then A = tan( π6 ) = 1/ 3 ≈ 0.577 and (2.3) becomes
r ! r !
4 2 4 4 2 1
+ Yt − Yt−1 + − Yt−2 = (Xt + 2Xt−1 + Xt−2 ),
3 3 3 3 3 3
or, approximately,
2.15Yt − 1.333Yt−1 + 0.517Yt−2 = 0.333(Xt + 2Xt−1 + Xt−2 ).
Role of the order of the filter. Order of the filter determines the shape
of the transfer function, and therefore it determines the quality of a filtration. To
illustrate the difference between the filters of different order, we apply them to the
(simulated) data (Figure 8) of the form Xt = St + εt where the signal St = 1 if t
belongs to one of the intervals [101, 200], [280, 300] or 380, 400], otherwise it is equal
to zero, and the noise εt is a white noise with expectation zero and variance σ 2 = 1.
For instance, we can think of a signal as a dash-dot-dot signal, which corresponds
190
3
2
1
1.0
0.8
0.6
0.4
0.2
Figure 9. Low pass tangent filter of the order 2, with the same
cut-off point π/30. The curve looks just a bit more smooth than
the sine filter results.
1.0
0.8
0.6
0.4
0.2
Figure 10. Low pass tangent filter of the order 4, with the same
cut-off point π/30.
to the character “d” in Morse dash-dot code. We can treat it as a low frequency
signal, and apply low pass filters to the data. On Figures 9-11, we can see tangent
filters of the order 2, 4 and 6 with cut-off point ω0 = π/30 in action. Figure 12
explains how the cut-off point was chosen.
191
1.0
0.8
0.6
0.4
0.2
Figure 11. Low pass tangent filter of the order 6, with the same
cut-off point π/30.
Periodogram HfragmentL
3.5
3.0
2.5
2.0
1.5
1.0
0.5
0.0 Frequency
0.07 cut 0.2 0.3
Figure 12. How to choose the cut-off point? Why did we choose
π/30? We have looked at the periodogram, decided about the band
that contains the signal (low frequency signal in this example, the
band is (0, 0.07) or so) and then choose the cut-off point with some
room to spare. Don’t make the band way too narrow, better let a
bit of noise in.
High pass tangent filters. In a similar way, we could construct other types
of filters. A high pass tangent filter of the order n with the cut-off frequency ω0 is
a filter with the transfer function
1
T (ω) = 2n
cot(ω/2)
1 + cot(ω 0 /2)
(similar to the low pass filter, we have T (π) = 1, T (0) = 0 and T (ω0 ) = 1/2).
Construction of the high pass filter begins with the identity
ω eiω + 1
cot = i iω
2 e −1
Denote C = cot(ω0 /2). Replacing eiω by z, we get a following analytic continuation
for T̃ (z):
C 2n
T̃ (z) = 2n
z+1
C 2n + (−1)n z−1
192
and we can find a2 /b2 from the normalization condition α(−1) = β(−1). After all
transformations, we get the same equation
√
C 2 a2 = (1 − 2C + C 2 )b2 .
as in the case of the second order low pass filter (no surprise,
√ compare the expres-
sions for α(z) and β(z) in both cases). Choosing a2 = 1 − 2C + C 2 and b2 = C 2 ,
we end up with the following expression:
√ √
(1 + 2C + C 2 )Yt − 2(C 2 − 1)Yt−1 + (1 − 2C + C 2 )Yt−2
(2.6)
= C 2 (Xt − 2Xt−1 + Xt−2 )
5π
Example. Let ω0 = 6 . Then C = cot( 5π
12 ) ≈ 0.268 and (2.6) becomes
1.451Yt − 1.856Yt−1 + 0.692Yt−2 = 0.0718(Xt − 2Xt−1 + Xt−2 ).
Band pass and band reject filters have two cut-off points. Traditionally,
they are parameterized by the center of the band ωc and the bandwidth 2B (so the
cut-off points are ωc ± B). Band pass filter of the order n has the transfer function
1
(2.7) T (ω) = n
cos(ω)−cos(ωc ) sec(B)
1+ tan B sin(ω)
Second order band pass tangent filter. Let the transfer function of the
filter be given by (2.7) with n = 2:
1
(2.9) T (ω) = 2
1 + cos(ω)−cos(ω c ) sec(B)
tan B sin(ω)
It is natural to assume that 0 < ωc ± B < π (the band is strictly inside the interval
[0, π]). In addition, we assume that B 6= π/4.
Denote
cos(ωc )
D = cos(ωmax ) = cos(ωc ) sec(B) = , E = tan B
cos(B)
Clearly, 0 < B < ωc < π − B and therefore cos(B) > cos(ωc ) > cos(π − B) =
− cos(B). For this reason, |D| < 1. Also, E 6= 1 by assumption. We have
1
T (ω) = 2
1 + cos(ω)−D
E sin(ω)
2. Find all the roots and all the poles of T̃ (z). The function would have four
roots w1 , 1/w1 , w2 , 1/w2 and four poles z1 , 1/z1 , z2 , 1/z2 . Take those poles that are
outside of the unit circle (we denote them by z1 and z2 ). Select half of the zeroes
to be used (one out of each pair; denote them by w1 , w2 ). Get
α(z) = a2 (z − z1 )(z − z2 ), β(z) = b2 (z − w1 )(z − w2 )
3. Find the quotient a2 /b2 from the normalization condition (maximal value of
T (ω) should be equal to 1).
To begin with, recall that
eiω + e−iω eiω − e−iω
cos(ω) = , sin(ω) =
2 2i
−iω
Replacing eiω by z and e by 1/z, we get the following formula for T̃ (z):
1 1
(2.10) T̃ (z) = 2 = (z 2 −2Dz+1)2
1+ i z+1/z−2D 1 − E 2 (z2 −1)2
E(z−1/z)
Let us find the roots and the poles of the function T̃ (z).
Clearly, T̃ (z) = 0 if and only if (z 2 − 1)2 = 0, that is, if z = ±1. Each of those
roots has multiplicity 2. According to the procedure, we should take one of each of
them. So, w1 = 1, w2 = −1 and therefore
β(z) = b2 (z 2 − 1)
Now, we have the following equation for the poles:
(z 2 − 2Dz + 1)2
=1
E 2 (z 2 − 1)2
which is equivalent to the equation
z 2 − 2Dz + 1 = ±E(z 2 − 1)
So, the poles of the function T̃ (z) could be found from the following two quadratic
equations
(2.11) (1 + E)z 2 − 2Dz + (1 − E) = 0
or
(2.12) (1 − E)z 2 − 2Dz + (1 + E) = 0
The equations have real roots if D2 + E 2 ≥ 1, otherwise all the roots are complex.
But, which roots are outside of the unit circle?
First of all, note that if z satisfies (2.11), then 1/z satisfies (2.12).
Suppose that D2 + E 2 < 1, so the roots are imaginary. Then, for each of
the equations, the roots are conjugate to each other, they have the same absolute
value, so their product is equal to the square of their absolute value. However, the
product of the roots can be easily found from the coefficients. Hence, the square of
the absolute value of the roots is equal to (1 − E)/(1 + E) for (2.11) and it is equal
to (1 + E)/(1 − E) for (2.12). Since 0 < E < 1 in this case, (2.12) is the equation
with the roots outside of the unit circle.
Let now D2 + E 2 ≥ 1, so the roots are real. Let us show that (2.11) has both
roots within the interval [−1, 1]. Indeed, p(z) = (1 + E)z 2 − 2Dz + (1 − E) > 0 at
1 and −1 because |D| < 1. Also, p(z) reaches the minimum at z0 = D/(1 + E).
2
−D 2
Since |D| < 1 and E > 0, |z0 | < 1. However, the value p(z0 ) = 1−E 1+E ≤ 0. So,
195
it is again the equation (2.12) that has the roots bigger than one in absolute value.
(If B = π/4 and E = 1 , then the equation (2.12) degenerates into the first order
equation. However, its root 1/D is still bigger than one in absolute value, and both
of the roots of (2.11) (which are 0 and D) are inside [−1, 1].)
From here, it follows that z1 z2 = (1 + E)/(1 − E), z1 + z2 = 2D/(1 − E) and
therefore
2D 1+E
α(z) = a2 (z 2 − z+ )
1−E 1−E
It remains to find the quotient a2 /b2 . As we can see from (2.9), T (ω) = 1
if ω = ωmax = arccos(D). So, we should have T̃ (z) = 1 if z = zmax = eiωmax .
However, p
zmax = cos(ωmax ) + i sin(ωmax ) = D + i 1 − D2
(since 0 < ωmax < π, sin(ωmax ) > 0). Hence,
p
2
zmax = 2D2 − 1 + 2iD 1 − D2
and
p 2D p 1+E
α(zmax ) = a2 (2D2 − 1 + 2iD 1 − D2 − (D + i 1 − D2 ) + )
1−E 1−E
E p E
= a2 (2(D2 − 1) + i2D 1 − D2 )
E−1 E−1
E p
= 2a2 ((D2 − 1) + iD 1 − D2 )
E−1
In a similar way,
p p
β(zmax ) = b2 (2D2 − 2 + 2iD 1 − D2 ) = 2b2 ((D2 − 1) + iD 1 − D2 )
and therefore
β(zmax ) b2 E − 1
=
α(zmax ) a2 E
For instance, we can take b2 = E and a2 = E − 1. Hence, the formula for the filter
becomes
(2.13) (E + 1)Yt − 2DYt−1 + (1 − E)Yt−2 = EXt − EXt−2 .
Second order band reject tangent filter. With band pass filter done, band
reject filter is easy. Since the transfer function of the band reject filter is one minus
the transfer function of the band pass filter, we immediately get from (2.10)
1
T̃ (z) = 1 − (z 2 −2Dz+1)2
1 − E 2 (z2 −1)2
Moreover, T̃ (z) has the same poles as the function (2.10) and, as above,
2D 1+E
α(z) = a2 (z 2 − z+ ).
1−E 1−E
√
Next, T̃ (z) = 0 if and only if z 2 − 2Dz + 1 = 0, that is if z = D ± i 1 − D2 (and
each of the roots actually has multiplicity two). We need to choose two of them
and, for the first time, it looks like we have some freedom. However, the only√ way
to get a filter with
√ coefficients that are real numbers, is to take w1 = D + i 1 − D2
2
and w2 = D − i 1 − D . Therefore
β(z) = b2 (z 2 − 2Dz + 1)
196
The quotient a2 /b2 could be found from the condition T (0) = T (π) = 1, which is
equivalent to
α(1) = β(1), α(−1) = β(−1)
However,
2a2 (1 − D)
α(1) = , β(1) = 2b2 (1 − D)
1−E
2a2 (1 + D)
α(−1) = , β(−1) = 2b2 (1 + D)
1−E
and we can take a2 = 1 − E, b2 = 1. Hence, the formula for the filter becomes
(2.14) (E + 1)Yt − 2DYt−1 + (1 − E)Yt−2 = Xt − 2DXt−1 + Xt−2 .
Finally, band pass and band reject filters were constructed under the assump-
tion B 6= π/4 which is equivalent to E 6= 1. However, both formulas (2.13) and
(2.14) work in that case as well (though α(z) becomes a first order polynomial).
Example. Suppose the center of the band ωc = 5π π
12 and the bandwidth is 5 ,
π 5π π π
that is B = 10 . Then D = cos( 12 ) sec( 10 ) ≈ 0.272 and E = tan( 10 ) ≈ 0.325. For
the band pass filter, (2.13) becomes
1.325Yt − 0.544Yt−1 + 0.675Yt−2 = 0.325(Xt − Xt−2 ).
For the band reject filter, (2.14) becomes
1.325Yt − 0.544Yt−1 + 0.675Yt−2 = Xt − 0.544Xt−1 + Xt−2 .
Remark. Sometimes, the parametrization by the center of the band and the
bandwidth is not convenient. For instance, we may want to design, say, a band
reject filter with given bandwidth in such a way that T (ω0 ) = 0 for some specific
frequency ω0 , so that this frequency will be completely eliminated. If this is the
case, the answer could be found from the same equations (2.13) or (2.14) if we set
D = cos(ω0 ), E = tan(B).
Example. The data shown on the Figure 13 contains a strong periodic com-
ponent with frequency ω0 ≈ 1.69163. However, we suspect that the data may
also contain some other signals. Using a low pass or high pass filter will not solve
the problem, the oscillation is so strong that it will make through unless T (ω0 ) is
practically zero. So, we would like to create a band reject filter such that its trans-
fer function vanishes at ω0 . The parameter B is at our disposal, we have chosen
B = π/10. We get
D = cos(ω0 ) = −0.120537, E = tan(B) = 0.32492
and the formula (2.14) becomes
(1 + 0.181953B + 0.509525B 2 )Yt = (0.754763 + 0.181953B + 0.754763B 2 )Xt
The results of filtration could be seen on Figure 14. Indeed, the data contains a
dash-dot signal. Just for comparison, Figure 15 shows what happens if we apply a
low pass filter with cut-off point π/20 instead.
Exercises
1. Compute the coefficients of a 2nd order low pass tangent filter with the
cut-off point ω0 = π/30, and apply it to the data (will be posted on the web).
197
1000
500
-1000
-1
-10
-15
-20
Figure 15. 2nd order low pass filter with cut-off point π/20, ap-
plied to the same data set.
2. Compute the coefficients of a 2nd order high pass tangent filter with the
cut-off point ω0 = 19π/20, and apply it to the data (will be posted on the web).
3. Compute the coefficients of a 2nd order band pass tangent filter with cut-off
points 5π/6 ± π/20 and apply it to the data (will be posted on the web).
4. Compute the coefficients of a 2nd order band reject tangent filter with cut-off
points π/3 ± π/10 and apply it to the data (will be posted on the web).
198
6*. Let T (ω) be given by the formula (2.7). Show that T (ω) → 1 if |ω−ωc | < B
and T (ω) → 0 if |ω − ωc | > B as n → ∞.
7*. Following the procedure described above, construct a 3rd order low pass
tangent filter with cut-off point ω0 .
8*. The data (will be posted on the web) contains a signal, some noise and
two strong periodic components, one with frequency π/7 and the other one with
frequency π/8. In order to decode the signal, design a second order band-reject
tangent filter (use B = π/15) that would completely eliminate the frequency π/8
and apply it to the data. Next, design a band reject filter (again, use B = π/15)
that would completely eliminate the frequency π/7 and apply it to the result of the
first filtration. Finally, apply the second order low pass tangent filter with cut-off
point π/15 to the result of the second filtration. Graph the results.
Project
A data set contains dash-dot signals hidden in the noise. Decode the signals!
(data set will be posted on the web).
1.0
0.8
0.6
0.4
0.2
Figure 16. Low pass tangent filter of the order 2, applied in for-
ward and backward direction.
1.0
0.8
0.6
0.4
0.2
2.0
1.5
1.0
0.5
2.5
2.0
1.5
1.0
0.5
Another solution (an “exact” one) is to use so-called FIR (finite impulse re-
sponse) filters. We are looking for a filter of the form
k=M
X
(3.2) Yt = bk Xt−k
k=−M
201
3.0
2.5
2.0
1.5
1.0
0.5
1.0
0.8
0.6
0.4
0.2
Figure 21. Potter low pass filter with M = 75 and cut-off point
π/30 applied to the data shown on Figure 8. Compare it with
other (tangent) filters, especially with the one shown on Figure 16
Finally,
k=M
X
(3.4) Yt = wk bk Xt−k
k=−M
40
20
-40
15 000
10 000
5000
15 000
10 000
5000
Seasonality
(1.1) Xt = a + bt + St + εt
where Tt = a + bt stands for the trend (we could use another model for the trend,
say parabolic or exponential or any other), and St stands for seasonal component,
so we assume that it is periodic with period 12. They usually call St the seasonal
index (Sj can be interpreted as an adjustment for the month j). We normally
assume that
(1.2) S1 + · · · + S12 = 0
(we always can achieve that by modifying the parameter a). We treat a, b and the
values S1 , . . . , S11 as unknown parameters, and we could use least squares in order
to estimate them.
205
206
Data
1.0
0.8
0.6
0.4
0.2
Months
50 100 150 200 250
Periodogram
0.07
0.06
0.05
0.04
0.03
0.02
0.01
Periods
12 6 4 3 2
ACF
1.0
0.5
Lags
1 5 10 15 20 25 30
-0.5
-1.0
In fact, this particular model fits exactly into the multiple linear regression
(1) (12) (1)
scheme. We define extra variables Xt , . . . Xt as follows: Xt = 1 if t corre-
(1)
sponds to January, otherwise Xt = 0, and so on. Then the model (1.1) can be
207
Periodogram
0.14
0.12
0.10
0.08
0.06
0.04
0.02
Periods
12 6 4 3 2
rewritten as
12
(j)
X
Xt = bt + Sj Xt + εt
j=1
After applying the least squares, we can set a = (S1 + · · · + S12 )/12 and subtract
a from every Sj , in order to satisfy (1.2).
If the magnitude of the seasonal effect clearly looks proportional to the data
(more common), we have a choice between the preliminary logarithmic transforma-
tion and additive model (1.1) or a multiplicative model given by the equation
(1.3) Xt = (a + bt)St + εt
where Tt = a + bt is the (linear) trend, and St is the multiplicative seasonal index.
In this case, Sj should be positive, and the condition (1.2) makes no sense. It is
commonly replaced either by the condition
S1 + · · · + S12
=1
12
or by the condition
S1 · · · · · S12 = 1
However, estimation of the parameters of the multiplicative model is no longer a
linear regression.
2. Seasonal Exponential Smoothing. The second possibility is to adopt the
ideas of the exponential smoothing (so called Holt-Winters model ). It is designed
to handle the case when the coefficients of the previous model (trend coefficients
and seasonal indices) fluctuate in time. We assume that the series can be locally
approximated by the multiplicative model (1.3). Each time, we compare the one
step ahead forecast with the actual data, and adjust the coefficients accordingly.
Ordinary, a (non-seasonal) exponential smoothing depends on the smoothing
parameter α which determines the sensitivity of the model (how strongly does the
model react to fluctuations). In order to make the seasonal version of the method
more flexible, they have introduced three smoothing parameters α, γ, δ, one of them
controls the adjustment of the current level a, the second controls the slope b and
the last controls the adjustment of the seasonal index S. To be precise, we assume
208
that the k steps ahead forecast, constructed at time n, is given by the formula
X̂n+k (n) = (an + bn k)Sn−12+k
where at , bt , St are the coefficients of the model used at time t. The adjustment
equations are as follows:
Xn
an = α + (1 − α)(an−1 + bn−1 )
Sn−12
bn = γ(an − an−1 ) + (1 − γ)bn−1
Xn
Sn = δ + (1 − δ)Sn−12
an
As we can see, if all the smoothing constants are (practically) zeroes, then no
adjustment occurs (except the obvious an = an−1 + bn−1 due to the change of the
origin from n − 1 to n).
An additive version of this model is called the Theil-Wage model. This time, we
assume that the data can be locally approximated by additive model (1.1). Hence,
the k steps ahead forecast, constructed at time n, equals
X̂n+k (n) = an + bn k + Sn−12+k .
As above, we have three smoothing parameters α, γ, δ, one of them controls the
adjustment of the current level a, the second controls the slope b and the last
controls the adjustment of the seasonal index S. The corresponding adjustment
equations are
an = α(Xn − Sn−12 ) + (1 − α)(an−1 + bn−1 )
bn = γ(an − an−1 ) + (1 − γ)bn−1
Sn = δ(Xn − an ) + (1 − δ)Sn−12 .
Practical recommendation: Try to minimize average one step ahead prediction
error as a function of the smoothing parameters. If any of the optimal smoothing
parameters (especially α) turns out to be greater than .25, the model is not working
properly.
3. A seasonal version of the Box-Jenkins (ARIMA) model. Suppose a
series Xt has the structure (1.1). Consider a new series
Yt = Xt − Xt−12 = (1 − B 12 )Xt
(Recall that the operator ∇12 = 1 − B 12 is called the seasonal difference). As we
can see,
Yt = 12b + εt − εt−12
is a stationary series. Moreover, it is a moving average process of the order 12
(though it is not invertible). If the trend is not linear, we may have to take another
difference and consider
Zt = ∇∇12 Xt
In our case, Zt = εt − εt−1 − εt−12 + εt−13 . However, it is stationary (and even with
zero expectation). Based on this observation, the following procedure has been
suggested.
a. Make sure the seasonal component is not multiplicative (take logarithm if
necessary).
209
Sales
12 000
10 000
8000
6000
4000
2000
Months
0 20 40 60 80 100 120 140
2. Examples
In all the examples below, we set aside twelve last points (one year). We con-
struct a prediction using one of the methods described above, and compare it with
actual data.
1. As a first example, consider monthly sales of the dry cleaning (Figure
5). This is a monthly data, 12 years, 144 points. The seasonal effects definitely
look proportional to the current level, so we could try a linear trend model with
multiplicative seasonality. Our second option is to take a logarithm and then try
a linear or quadratic trend with additive seasonality. On Figures 6 and 7, you can
see a prediction made at the end of eleventh year, compared with the actual data.
As we can see from the graphs, multiplicative model is not working properly (one
third of points is out of confidence region).
We could also use a seasonal version of the exponential smoothing. For the
multiplicative model (Holt-Winters), optimization of the smoothing constants gives
α = 0.14, γ = 0.067 and δ = 0.415. Second option is to take logarithm and apply an
additive version of the model (Theil-Wage). Optimization of the parameters gives
us α = 0.232, γ = 0.085 and δ = 0.506 In both cases, the parameter δ (the one
210
14 000
12 000
10 000
8000
6000
130 132 134 136 138 140 142 144
14 000
12 000
10 000
8000
6000
130 132 134 136 138 140 142 144
14 000
12 000
10 000
8000
6000
130 132 134 136 138 140 142 144
responsible for adjustment of the seasonal coefficients) looks a bit large. However,
as you can see on the graphs (Figures 8 and 9), prediction looks very good.
Finally, we could take a logarithm and apply a seasonal version of the Box-
Jenkins (ARIMA) model. We take simple and seasonal differences in order to make
211
14 000
12 000
10 000
8000
6000
130 132 134 136 138 140 142 144
14 000
12 000
10 000
8000
6000
130 132 134 136 138 140 142 144
the series stationary. Models of different structure look good here, in particular,
the following one
(1 + 0.429B)(1 − 0.29B 12 )∇∇12 Xt = (1 − 0.9941B 12 )εt
(see Figure 10). However, estimated value of the seasonal moving average parameter
is very close to −1 which makes the model very close to a non-invertible one.
Because of that, trend models look more reasonable.
2. Another data set (Figure 11), also 12 years, 144 points, represents a num-
ber of air passengers. Seasonal effects are not so stable here, since some holidays
move around (for instance, Easter). Once again, we may begin with trend models
(multiplicative looks more reasonable, see Figure 12). Exponential smoothing does
not seem to work here, because optimal value of one of the smoothing parameters
turns out to be 0.95. ARIMA models seem very good. Again, various models could
be tried, in particular, this one:
(1 + 0.06B 12 )Xt = (1 − 0.35B)(1 − 0.59B 12 )εt
(see Figure 13). Here, p-values for all estimated parameters are practically zeroes
(so all of them are significant), and the p-value for the Portmanteau test for the
residuals is 0.8163 (so it is a white noise).
3. Our next example is women unemployment data (in UK? 67 points, 5.5
years, since January 1967, Figure 14). Seasonal effects look proportional to the
level here, so we can either try a trend model with multiplicative seasonality, or
212
Data
600
500
400
300
200
100
Months
20 40 60 80 100 120 140
700
600
500
400
700
600
500
400
Data
1.3
1.2
1.1
1.0
0.9
0.8
0.7
Months
0 10 20 30 40 50 60
Data
1.3
1.2
1.1
1.0
0.9
0.8
0.7 Months
45 50 55 60 65
Data
1.4
1.2
1.0
0.8
Months
45 50 55 60 65
this one:
(1 − 0.3487B)∇∇12 Xt = εt
4. Temperature data (since January 1960, 20 years). As we can see, there is
no trend. Hence, there is no reason to consider logarithms or multiplicative trend
model. However, we still may consider a model
Xt = a + St + εt
214
Data
8
6
4
Months
0 50 100 150 200
Data
6.5
6.0
5.5
5.0
4.5
4.0
3.5
3.0 Months
210 215 220 225 230 235 240
Data
6.5
6.0
5.5
5.0
4.5
4.0
3.5
3.0 Months
210 215 220 225 230 235 240
where a is a constant and St stands for seasonal component (see Figure 18). Equally,
we may consider an ARIMA type model, the following one looks alright:
(1 − 0.302B)(1 + 0.292B 12 )∇12 Xt = (1 − 0.871B 12 )εt
(see Figure 19). Note that, since there is no trend at all, we only take seasonal
difference. Exponential smoothing (Theil - Wage) works here as well, corresponding
smoothing parameters are α = 0.137, γ = 0.065 and δ = 0.214. However, the series
does not contain trend so model should be modified - there should be no γ at all.
5. Residential electricity (since January 1971, 9 years). The data is shown on
Figure 21. We can see some trend here, though it is probably not linear (changes
in population?). Also, we expect seasonal effects to be multiplicative, not addi-
tive (proportional to population size?). Estimated model is shown on Figure 22.
Next, we can try an ARIMA type model. Since we expect seasonal effects to be
215
Data
6.5
6.0
5.5
5.0
4.5
4.0
3.5
3.0 Months
210 215 220 225 230 235 240
Data
8
6
Months
0 20 40 60 80 100
Data
3 Months
80 85 90 95 100 105
multiplicative, we take the logarithm first. The following model looks good enough:
(1 − 0.328B)(1 + 0.624B 12 )∇12 log Xt = εt
(see Figure 23). Exponential smoothing (Holt - Winters) works somewhat (Figure
24), optimal smoothing parameters are α = 0.18, γ = 0.035 and δ = 0.552. And,
the parameter δ (responsible for the adjustment of seasonal coefficients) is way too
big.
6. Our last example is the car registration data briefly discussed above (22
years, since January 1947, see Figure 1). The series definitely contains a trend.
However, if we look at the graph, we realize that no parametric model could possibly
work here. Our only hope is a seasonal ARIMA model, and indeed, the following
one looks alright:
(1 + 0.216B + 0.29B 2 )(1 + 0.593B 12 + 0.402B 24 + 0.254B 36 )∇∇12 Xt = εt
216
Data
3 Months
80 85 90 95 100 105
Data
3 Months
80 85 90 95 100 105
Data
1.0
0.8
0.6
0.4
0.2 Months
245 250 255 260
(see Figure 25). Exponential smoothing is another option (Figure 26), however
corresponding values of the smoothing parameters are α = 0.54, γ = 0.006, δ = 0.5,
and two of the parameters are too big.
Exercises
1. Let Xt = (a + bt)St + εt , where St = St+12 for all t. Does the transformation
∇12 Xt = Xt − Xt−12 make the series stationary? If no, find the one which does.
217
Data
1.0
0.8
0.6
0.4
0.2 Months
245 250 255 260
Multivariate models
2. Suppose εt and ηt are two independent white noises with expectation zero
and variances σε2 and ση2 . Let Xt and Yt be given by the formulas
Xt = εt + bεt−1 + βηt−1
Yt = ηt + cηt−1 + γεt−1
219
220
Then
Var(Xt ) = (1 + b2 )σε2 + β 2 ση2 ,
Var(Yt ) = (1 + c2 )ση2 + γ 2 σε2 ,
and
RX (±1) = bσε2 ,
RY (±1) = cση2 ,
ĈXY (k)
(1.2) ρ̂XY (k) = q
R̂X (0)R̂Y (0)
Remark. It could be shown that ĈXY and ρ̂XY are consistent, asymptotically
normal and asymptotically unbiased.
Suppose both series have the expectation zero (known to us). Than the estimate
(1.1) takes form
N −k
1 X
ĈXY (k) = Xt Yt+k ,
N 1
(1.3) N
1 X
ĈXY (−k) = Xt Yt−k ,
N
k+1
N −|k|
and its expectation is equal to E ĈXY (k) = N CXY (k).
221
Suppose now that the series X and Y are, in fact, independent. Then E ĈXY (k) =
0 and
N −k N −k
1 X X
Var ĈXY (k) = E(ĈXY (k))2 = E(Xt Yt+k Xs Ys+k )
N 2 s=1 t=1
N −k N −k
1 X X
= RX (t − s)RY (t − s)
N 2 s=1 t=1
u=N −k−1
1 X
= (N − k − |u|)RX (u)RY (u)
N2
u=−(N −k)+1
For small k,
1 X−1
u=N
Var(ĈXY (k)) ≈ (N − |u|)RX (u)RY (u)
N2
u=−N +1
1 N −1 2
= (RX (0)RY (0) + 2 RX (1)RY (1) + · · · + RX (N − 1)RY (N − 1))
N N N
2 2
σX σY N −1 2
= (1 + 2 ρX (1)ρY (1) + · · · + ρX (N − 1)ρY (N − 1))
N N N
Finally,
1 N −1 2
(1.4) Var ρ̂XY (k) ≈ (1 + 2 ρX (1)ρY (1) + · · · + ρX (N − 1)ρY (N − 1))
N N N
Suppose now that one of the series is actually a white noise. Then Var ρ̂XY (k) ≈
1/N and
2
P {|ρ̂XY (k)| ≥ √ } ≈ .95.
N
However, if both series have a non-trivial correlation function, then the confidence
interval may be much bigger. A following procedure allows us to get around this
problem.
Step 1. (Pre-whitening) Construct a transformation
X̃t = α(B)Xt
such that X̃t is (approximately) a white noise (for instance, fit an AR(l) model of
a reasonable order).
Step 2. If Xt and Yt are independent, then X̃t and Yt are still independent,
and we can use the confidence intervals constructed above.
P∞Cross-spectrum. Suppose CXY (k) → 0 as k → ±∞, so that the series
k=−∞ |CXY (k)| converges. A cross-spectral density of X and Y is defined by the
formula
∞
1 X
fXY (ω) = CXY (k)e−iωk , −π ≤ ω ≤ π
2π −∞
Since CXY (k) is not even, the cross-spectral density is no longer a real-valued
function. However, we still have
Z π
CXY (k) = eiωk fXY (ω) dω
−π
222
Estimation of the cross-spectrum. All what was said about the estimation
of the spectral density, is applicable here as well. Once again, we either use a lag
window
N −1
1 X
fˆXY (ω) = λk ĈXY (k)e−iωk
2π
−N +1
where
k
λk = λ( ),
M
or a spectral window:
Z π
fˆXY (ω) = IXY (ω)k(ω) dω
−π
where
k(ω) = M K(M ω)
and
1 X X
IXY (ω) = Xt eiωt Yt e−iωt
2πN
is the cross-periodogram.
To illustrate the concept, we consider the electromagnetic activity of a human
brain.
224
CCF
1.0
0.5
Lags
-15 -10 -5 5 10 15
-0.5
-1.0
CrossSpectrum
100 000
70 000
50 000
30 000
20 000
15 000
10 000
Periods
30 10 5 4 3 2
Figure 2. Power cross-spectrum for the left and the right hemi-
sphere. The highest of the sharp peaks is a resonance frequency
of the meter, the second one is the frequency of the electric power
generator.
Phase
150
100
50
Periods
-50 30 10 5 4 3 2
-100
-150
Figure 3. Phase of the cross-spectrum for the left and the right
hemisphere. Kind of hard to interpret.
Coherency
1.0
0.8
0.6
0.4
0.2
Periods
30 10 5 4 3 2
Figure 4. Coherency function for the left and the right hemisphere.
CCF
1.0
0.5
Lags
-15 -10 -5 5 10 15
-0.5
-1.0
Coherency
1.0
0.8
0.6
0.4
0.2
Periods
30 10 5 4 3 2
and therefore
P0|0 = P0 .
First step. Suppose Xt−1|t−1 and Pt−1|t−1 are known. By the transition equa-
tion,
(2.4) Xt = At Xt−1 + et + at
Taking the conditional expectation with respect to Y≤t−1 , we therefore get
(2.5) Xt|t−1 = At Xt−1|t−1 + at
(et disappeared because Y≤t−1 depends on X0 , e0 , . . . , et−1 , f1 , . . . , ft−1 and is in-
dependent from et ). Subtracting (2.5) from (2.4), we get an expression for the
prediction error
Xt − Xt|t−1 = At (Xt−1 − Xt−1|t−1 ) + et
(at cancels out). Therefore
Pt|t−1 = E(Xt − Xt|t−1 )(Xt − Xt|t−1 )T
(2.6) = At E(Xt−1 − Xt−1|t−1 )(Xt−1 − Xt−1|t−1 )T ATt + Eet eTt
= At Pt−1|t−1 ATt + Qt
Second step. Suppose now that Xt|t−1 and Pt|t−1 are known, and Yt arrived.
We have
Yt = Ht Xt + ft
and therefore
Yt|t−1 = Ht Xt|t−1
where Yt|t−1 = E(Yt |Y≤t−1 ) is the prediction for Yt given Y≤t−1 . Once again, ft
disappears because it is independent from Y≤t−1 . Denote by
Zt = Yt − Yt|t−1 = Ht (Xt − Xt|t−1 ) + ft
the prediction error. Denote by Ft the covariance matrix of Zt . It is equal to
Ft = EZt ZTt = E(Yt − Yt|t−1 )(Yt − Yt|t−1 )T
(2.7)
= Ht Pt|t−1 HtT + Rt
It is easy to see that Zt is uncorrelated with Y≤t−1 (and therefore independent from
it). In a sense, Zt represents the new information which is contained in Yt . Using
this property, one can show that the best predictor for Xt given Y≤t is equal to
Xt|t = Xt|t−1 + E(Xt |Zt )
Since everything is Gaussian, the conditional expectation E(Xt |Zt ) can be repre-
sented in terms of the covariance matrix of Xt and Zt , and the covariance matrix
of Zt which is Ft . Namely,
E(Xt |Zt ) = E(Xt ZTt )Ft−1 Zt
where
E(Xt ZTt ) = EXt (Xt − Xt|t−1 )T HtT + EXt ft
= E(Xt − Xt|t−1 )(Xt − Xt|t−1 )T HtT + EXt|t−1 (Xt − Xt|t−1 )T HtT
= Pt|t−1 HtT
229
All the other terms disappear because ft is independent from Xt and (Xt − Xt|t−1 )
is independent from Y≤t−1 and therefore from Xt|t−1 as well. Finally, we get
(2.8) Xt|t = Xt|t−1 + Pt|t−1 HtT Ft−1 (Yt − Ht Xt|t−1 )
which implies
Xt − Xt|t = Xt − Xt|t−1 − Pt|t−1 HtT Ft−1 (Yt − Ht Xt|t−1 )
and
(2.9) Pt|t = Pt|t−1 − Pt|t−1 HtT Ft−1 Ht Pt|t−1
Combining both steps together, we get
(2.10) Xt+1|t = At+1 Xt|t−1 + At+1 Pt|t−1 HtT Ft−1 (Yt − Ht Xt|t−1 )
and
(2.11) Pt+1|t = At+1 (Pt|t−1 − Pt|t−1 HtT Ft−1 Ht Pt|t−1 )ATt+1 + Qt
(note that the matrix Ft given by (2.7), depends on Pt|t−1 ). The matrix
Kt = At+1 Pt|t−1 HtT Ft−1
from (2.10) is called the gain matrix for the Kalman filter.
APPENDIX A
Elements of Probability
1. Basic Concepts
A1.1. Sample space, Events, Probabilities. We begin with a concept of
a sample space, typically denoted by Ω. It is defined as a collection of all possible
outcomes. Elements of the sample space ω ∈ Ω are called outcomes or elementary
events. Subsets A ⊂ Ω of the sample space are called events (intuitively, an event
is a collection of outcomes with certain property).
Events A1 , A2 , . . . are called mutually exclusive, or disjoint, if Ai ∩ Aj = ∅ for
i 6= j.
To every event A, there corresponds its probability P (A). The probabilities
P (A) must satisfy the following properties (Axioms of Probability).
1. 0 ≤ P (A) ≤ 1.
2. P (Ω) = 1.
3. (countable additivity or σ-additivity). If events A1 , A2 , . . . are disjoint, then
∞
X
P (∪∞
n=1 An ) = P (An )
n=1
Intuitively, P (A) is the frequency of the occurrence of the event A if we repeat the
same random experiment again and again. So, if P (A) = 0, then the event A is
impossible. If P (A) = 1, then A will definitely occur.
Further properties of the probabilities. Axioms of the probability imply:
1. P (∅) = 0.
n
Pn 2. (Finite additivity) If A1 , . . . , An are mutually exclusive, then P (∪k=1 Ak ) =
k=1 P (Ak ).
3. If Ac = Ω \ A is the complement of A, then P (Ac ) = 1 − P (A)
4. (Monotonicity) If A ⊂ B, then P (A) ≤ P (B).
5. (Continuity) (a) Let An be a sequence of events such that An ⊂ An+1 for
all n. Then P (∪∞ n=1 An ) = limk→∞ P (An ).
(b) Let An be a sequence of events such that An ⊃ An+1 for all n. Then
P (∩∞n=1 An ) = limk→∞ P (An ).
See Problems 1-7 at the end of the section.
Conditional Probabilities. Suppose we know that the event B has occurred.
Can we say anything about another event A? To address this problem, we define a
conditional probability of A given B by the formula
P (AB)
P (A|B) =
P (B)
(provided P (B) 6= 0). In a sense, B acts as a new sample space, only outcomes
ω ∈ B are possible. There are two major formulas related to the concept.
231
232
(again, we assume that the series converges absolutely). If the integral (or series)
does not converge, then the random variable does not have expectation (or, which
is the same, EX does not exist).
Properties of the expectation. Expectation of a random variable has the
following properties.
1. If c is a constant, then Ec = c
2. Linearity: E(aX + bY ) = aEX + bEY if both EX and EY exist. In
particular, E(aX + b) = aEX + b.
3. Monotonicity: If X ≥ Y , then EX ≥ EY .
4. If X ≥ 0 and EX = 0, then X = 0 with probability 1.
5. Useful formula. Suppose Y = g(X) where g is a real-valued function and X
has a continuous distribution with the density fX . Then
Z ∞
EY = Eg(X) = g(t)fX (t) dt
−∞
In contrast to variance, covariance can be of any sign. For instance, Cov(X, −X) =
− Var(X).
Correlation coefficient. It is defined by the formula
Cov(X, Y )
Corr(X, Y ) = ρ(X, Y ) = p
Var(x) Var(Y )
Schwartz inequality implies −1 ≤ Corr(X, Y ) ≤ 1. It could be shown that, if
| Corr(X, Y )| = 1, then Y = aX + b.
Variance—covariance matrix. Let X1 , . . . , Xn be random variables, and let
Cij = Cov(Xi , Xj ). A square matrix C = (Cij ) is called the variance-covariance
235
(to verify that, note that the expression in question is actually equal to Var(a1 X1 +
· · · + an Xn ) and therefore must be non-negative.)
For the rest of the exposition, we focus on continuous random variables.
A1.7. Joint distributions. We say that random variables X1 , . . . , Xn have a
continuous joint distribution with the density f (x1 , . . . , xn ) = fX1 ,...,Xn (x1 , . . . , xn )
if
Z b1 Z bn
P {a1 ≤ X1 ≤ b1 , . . . , an ≤ Xn ≤ bn } = ... f (x1 , . . . , xn ) dxn . . . dx1
a1 an
for every collection of real numbers a1 < b1 , . . . , an < bn . The function fX1 ,...,Xn is
called the joint density of random variables X1 , . . . , Xn
The joint density f is non-negative and satisfies the condition
Z ∞ Z ∞
... f (x1 , . . . , xn ) dxn . . . dx1 = 1
−∞ −∞
Along the same lines, we can verify that E[f (X)Y |X = x] = f (x)E(Y |X = x),
which implies the last property. If, however, X and Y do not have a joint density,
those properties still hold though their verification is more involved.
In a similar way, we can define a conditional variance Var(Y |X) as a vari-
ance with respect to the conditional distribution. For the conditional variance, the
following identity holds:
(1.7) Var(Y ) = E(Var(Y |X)) + Var(E(Y |X))
Example. Importance of conditional expectations could be seen from the
following problem. Suppose we want to predict Y given the value of X. In other
words, we’d like to choose a function h(x) such that h(X) is “close” to Y . To be
specific, we want the expectation
E(Y − h(X))2
237
to be as small as possible. We claim that the best h(X) coincides with g(X) =
E(Y |X). Indeed,
E(Y − h(X))2 = E((Y − g(X)) + (g(X) − h(X)))2
= E(Y − g(X))2 + E(g(X) − h(X))2 + 2E[(g(X) − h(X))(Y − g(X))]
Now, we claim that the last term on the right is equal to zero. To verify that, we
use the properties of conditional expectation. We have
E[(g(X) − h(X))(Y − g(X))] = E[E[(g(X) − h(X))(Y − g(X))|X]]
= E[(g(X) − h(X))E[Y − g(X)|X]]
(we used the second property and after that, the third one). But,
E(Y − g(X)|X) = E(Y |X) − g(X) = g(X) − g(X) = 0.
So,
E(Y − h(X))2 = E(Y − g(X))2 + E(g(X) − h(X))2
where the first term does not depend on h and the second is non-negative. So, the
whole thing is minimal if h = g.
A1.11. Distribution of a sum of independent random variables. If
X, Y are independent random variables and if fX , fY are corresponding densities,
then the density of Z = X + Y could be found as a convolution of the densities fX
and fY :
Z ∞
(1.8) fX+Y (z) = fX (x)fY (z − x) dx
−∞
Assume that the functions g1 , . . . , gn are differentiable and define a 1-1 transforma-
tion, that is (1.10) implies
X1 = h1 (Y1 , . . . , Yn )
(1.11) ...
Xn = hn (Y1 , . . . , Yn )
(this assumption is harmless to us because we will be primarily interested in case
when g1 , . . . , gn are linear). Let
∂hi
H = (Hij ) =
∂yj
be the matrix of the partial derivatives of the inverse transformation. Then
(1.12)
fY1 ...Yn (y1 , . . . , yn ) = |J(y1 , . . . , yn )|fX1 ...Xn (h1 (y1 , . . . , yn ), . . . , hn (y1 , . . . , yn ))
where J = det H, the determinant of H = (Hij ), is called the Jacobian of the
transformation.
A1.13. Moment generating function. For a random variable X, its mo-
ment generating function, or m.g.f. φ(t) is defined by the formula
(1.13) φ(t) = φX (t) = EetX .
We have φ(0) = 1. The expectation EetX may fail to exist for some t 6= 0. However,
if it is defined in a neighborhood of 0, then EX k exists for all k = 1, 2, . . . and
(1.14) φ(k) (0) = EX k .
The expectation EX k is called the k-th moment of random variable X. According
to (1.14), we can find all the moments by differentiating the moment generating
function at the origin, which explains the name.
Moment generating functions have many useful properties. Two of them are
of special importance. First of all, if moment generating functions of X and Y
coincide, then the distributions of X and Y coincide (in fact, it is enough to have
φX (t) = φY (t) in a neighborhood of zero). Second, if X and Y are independent,
then the m.g.f. of the sum X + Y is equal to the product of m.g.f.’s:
φX+Y (t) = Eet(X+Y ) = EetX etY = EetX EetY = φX (t)φY (t).
This property sometimes allows us to identify the distribution of the sum of two
independent random variables.
Finally, here is a useful scaling identity:
φaX+b (t) = etb φX (at).
A1.14. Characteristic function and Laplace transform. Moment gen-
erating functions have but one drawback, they may be undefined for some t, even
for all t 6= 0 (and then, they are quite useless). However, moment generating func-
tion has a big brother — characteristic function. For a random variable X, its
characteristic function ψ(t) is defined by the formula
ψ(t) = ψX (t) = EeitX
where i stands for the imaginary unit. It is therefore a complex-valued function,
more specifically, it is equal to
ψX (t) = E cos(tX) + iE sin(tX).
239
Since sin and cos are bounded functions, the characteristic function is well defined
for all t. It could be shown that the characteristic function is continuous. Also,
ψ(0) = 1. Properties of characteristic functions are similar to those of moment
generating functions. In particular, if characteristic functions of X and Y coincide
in a neighborhood of zero, then the distributions of X and Y coincide. Second, if
X and Y are independent, then the characteristic function of the sum X + Y is
equal to the product of characteristic functions:
ψX+Y (t) = Eeit(X+Y ) = EeitX eitY = EeitX EeitY = ψX (t)ψY (t).
Also,
ψaX+b (t) = eitb ψX (at)
For non-negative random variables, they usually use Laplace transforms. The
Laplace transform ϕX (t) for a non-negative random variable X is defined by the
formula
ϕ(t) = ϕX (t) = Ee−tX , t ≥ 0.
So, the Laplace transform is, essentially, the moment generating function restricted
to negative arguments: ϕX (t) = φX (−t). Still, the Laplace transform is well de-
fined for all non-negative t, it determines the distribution uniquely and ϕX+Y (t) =
ϕX (t)ϕY (t) for independent X, Y . In contrast to the characteristic function, Laplace
transform is real-valued. On the other hand, the characteristic function is the
Fourier transform of the density of X and therefore the density could be found
through the inverse Fourier transform. This trick does not work with Laplace
transforms (in order to find the density of X via its Laplace transform, we have to
consider it for complex arguments t).
The integral can be evaluated only numerically. Tables of the standard normal
distribution could be found practically in any text in probability and statistics.
If X is normal with parameters m and σ 2 , then
P {|X − m| < aσ} = 2Φ(a) − 1.
In particular,
P {|X − m| < 1.96σ} = 0.95,
P {|X − m| < 2.58σ} = 0.99,
P {|X − m| < 3σ} = 0.9973,
P {|X − m| < 3.29σ} = 0.999
A linear combination of independent normal random variables is again normal:
if X1 and X2 are independent and normally distributed with parameters m1 , σ12
and m2 , σ22 , then aX1 + bX2 also has a normal distribution with parameters am1 +
bm2 , a2 σ12 + b2 σ22 (see Problem 33b).
Moment generating function and characteristic function of a normal distribu-
tion are given by the formulas
t2 σ 2 −t2 σ 2
φX (t) = etm e 2 , ψX (t) = eitm e 2
where
n /2 n /2 Γ((n1 + n2 )/2)
Cn1 ,n2 = n1 1 n2 2
Γ(n1 /2)Γ(n2 /2)
Tables of the F distribution could be found in standard texts in statistics.
Uniform distribution. Random variable X has a uniform distribution on
the interval (a, b) if its density is given by the formula
(
1
if a < x < b;
f (x) = b−a
0 otherwise
Its expectation and variance are equal to
a+b (b − a)2
EX = , Var(X) =
2 12
In time series, uniform distribution may show up when we deal with periodical
functions. For instance, we may consider a series
Xt = sin(θt + U )
where t = 0, 1, 2, . . . is interpreted as time and U is random. If we want all possible
phases to be equally likely, a natural assumption would be that U is uniform on
the interval (0, 2π).
Cauchy distribution. Random variable X has Cauchy distribution if its
density is given by the formula
1 1
f (x) =
π 1 + x2
This is an example of a distribution that has no expectation and variance. It
shows up on occasion as a result of transformation of other random variables.
For instance, if U is uniform on the interval (−π/2, π/2), then X = tan(U ) has
Cauchy distribution (Problem 19). If Z1 , Z2 are independent standard normals,
then X = Z1 /Z2 is also Cauchy (Problem 24) and so on.
A2.5. Multivariate normal distribution. It is the most important one
for the time series analysis. We say that X1 , . . . , Xn have a joint, or multivariate,
normal, a.k.a. Gaussian, distribution if there exist independent standard normal
random variables Z1 , . . . , Zm such that
X1 = a11 Z1 + · · · + a1m Zm + m1
X2 = a21 Z1 + · · · + a2m Zm + m2
(2.5)
...
Xn = an1 Z1 + · · · + anm Zm + mn
(such a representation is not unique). Multivariate normal distribution can be
uniquely characterized by its vector of the expectations
m = (m1 , . . . , mn ) = (EX1 , . . . , EXn )
and the covariance matrix C = (Cij ) where
Cij = Cov(Xi , Xj ).
If the representation (2.5) is given and A = (aij ), then the matrix C equals
C = AAT .
243
E(Y |X1 , . . . , Xn ) = d0 + d1 X1 + · · · + dn Xn
In fact, this formula extends to a product of any even number of normal random
variables with zero expectations (so called Wick formula).
Example: bivariate normal distribution. It is a 2-dimensional distribu-
tion. It can be characterized by five parameters: expectations mX and mY of X
2
and Y , variances σX and σY2 of X and Y and, finally, the correlation coefficient
ρ = ρXY of X and Y . Then Cov(X, Y ) = ρσX σY ,
2
σX ρσX σY
C=
ρσX σY σY2
and
σY2
1 −ρσX σY
B= 2 2 2
σX σY (1 − ρ2 ) −ρσX σY σX
244
We say that the sequence Xn is a Cauchy sequence in mean squares if, for every
ε > 0, there exists N = N (ε) such that E(Xn − Xm )2 < ε whenever n, m ≥ N . It
could be shown that every sequence Xn that is Cauchy in mean squares, actually
converges in mean squares to some random variable X (it follows from the so called
completeness of L2 spaces, see advanced courses in Real Analysis).
A typical application of this result in time series analysis is as follows. Let Yn
be a sequence of random variables with zero expectation and finite variances, and
let
Sn = Y1 + · · · + Yn
P∞
be a partial sum of an infinite series k=1 Yk . The sequence Sn converges in mean
squares (and therefore the sum of the series could be defined) if it is Cauchy in
mean squares, that is if, for every ε > 0, there exists N such that
Var(Sn − Sm ) ≤ ε
whenever n, m ≥ N . In time series analysis, this property allows us to define a sum
of an infinite series made out of random variables if the partial sums converge in
mean squares.
Exercises
1. Show that P (∅) = 0. Hint: Use axiom of countable additivity and set Ak =
for all k ≥ 2.
2. Let A1 , . . . , An be mutually
Pn exclusive events. Using the result of Problem
1, show that P (∪nk=1 Ak ) = k=1 P (Ak ).
3. Show that P (Ac ) = 1 − P (A) where Ac = Ω \ A is the complement of A.
4. Show that, if A ⊂ B, then P (A) ≤ P (B).
5. Show that P (A ∪ B) = P (A) + P (B) − P (A ∩ B)
6*. Let An be a sequence of events such that An ⊂ An+1 for all n. Show that
P (∪∞ n=1 An ) = limk→∞ P (An ).
7*. Let An be a sequence of events such that An ⊃ An+1 for all n. Show that
P (∩∞ n=1 An ) = limk→∞ P (An ).
8*. If P (B) > 0, show that P (AB|B) ≥ P (AB|A ∪ B) and P (A|A ∪ B) ≥
P (A|B).
9. If events A1 , A2 , . . . , An are independent, then Ac1 , A2 , . . . , An are also inde-
pendent (so we can replace any of the events from the list by its complement and
still have the independency).
10. If A1 , . . . , An are independent events, then
n
Y
P (A1 ∪ · · · ∪ An ) = 1 − (1 − P (Ak )).
k=1
(b) Show that, if X and Y are independent normal random variables with param-
eters m1 , σ12 and m2 , σ22 , respectively, then X + Y is also normal with parameters
m1 + m2 and σ12 + σ22 (use part (a) and the properties of moment generating func-
tions).
248
34*. Let (X, Y ) have a bivariate normal distribution with the density given
by the formula (2.8).
(a) Assuming that the identity (2.3) (see Problem 25 above) is known, show
2
that the marginal distribution of X is N (mX , σX ).
(b) Verify that ρ is the correlation coefficient of X and Y . Conclude from
here (and from part (a)) that, for a bivariate normal distribution, X and Y are
independent if and only if they are uncorrelated.
(c) Using the result in part (a), show that the conditional distribution of Y
given X = x is normal with parameters mY + ρ σσX Y
(x − mX ) and σY2 (1 − ρ2 ). In
particular, the conditional expectation
σY
E(Y |X) = mY + ρ (X − mX )
σX
is linear and the conditional variance Var(Y |X) = σY2 (1 − ρ2 ) is a constant (does
not depend on X).
(d) Using part (c), verify that the relation E(E(Y |X)) = EY , as well as (1.7),
holds in this case. √
35. Show that Γ( 21 ) = π. Hint: Write down the integral (2.4) with α = 12
√
and try a substitution u = 2x. Compare the result with (2.3).
36. Show that if X and Y are identically distributed, not necessarily indepen-
dent, then
Cov(X + Y, X − Y ) = 0
37. Let X, Y be independent random variables, and let X be χ2 (n). Suppose
X + Y is χ2 (n + m). Show that Y is χ2 (m). [Hint: Express φX+Y in terms of φX
and φY and solve for φY . Compare the result with the moment generating function
of χ2 (m).]
38. Let X, Y be independent random variables, and let X be N (m1 , σ12 ).
Suppose X + Y is N (m1 + m2 , σ12 + σ22 ). Show that Y is N (m2 , σ22 ). [Hint: Express
φX+Y in terms of φX and φY and solve for φY . Compare the result with the
moment generating function of N (m2 , σ22 ).]
39*. Let X1 , . . . , Xn be i.i.d. random variables. Find
E(X1 |X1 + · · · + Xn = x)
40*. Let X1 , . . . , Xn be i.i.d. random variables that are strictly positive with
probability one. For all 0 ≤ k ≤ n, find
X1 + · · · + Xk
E
X1 + · · · + Xn
41*. Let X be a continuous random variable with finite expectation. A number
m is called a median if P (X < m) = 1/2. Show that the median m solves the
minimization problem
E|X − m| → min .
APPENDIX B
Elements of Statistics
1. Basics
In probability, starting point is a probabilistic model (e.g., independent random
variables with given distribution). Objective of the study is to establish certain
properties of the model. On contrast, in statistics, the starting point is a data set.
The probabilistic model for the data is not known or even does not exist. Typical
objective of the study is to find a reasonable model, estimate its parameters, test
statistical hypotheses, predict future behavior, etc.
Most of the common statistical methods are not applicable to time series with-
out major modifications because they are designed for samples, that is for the data
collected in the situation when the same experiment is repeated independently. We
discuss some basic methods, concepts and results.
Parametric and non-parametric approach. To some extent, statistical
methods could be divided into parametric and non-parametric ones. For paramet-
ric methods, we assume that the joint distribution of the data X1 , . . . , Xn is known
to us up to a few unknown parameters. We therefore have to estimate those pa-
rameters, to test statistical hypotheses about the values of the parameters etc. The
following rule of a thumb limits the complexity of the model: you should have at
least ten data points per parameter.
For non-parametric methods, we assume that the distribution belongs to a
certain class, e.g. has certain smoothness and so on. If our goal is to estimate the
distribution, then this approach requires bigger data sets; on the other hand, if
you have thousands of observations, then any parametric model would typically be
rejected (real world can’t be easily parameterized).
Looking ahead, some of the methods of time series analysis are non-parametric,
for instance spectrum estimation. Others are parametric, for instance ARMA mod-
els.
Here we focus on parametric methods.
B1.1. Example. Estimation of mean and variance. We begin with
the following problem. Suppose X1 , . . . , Xn are independent identically distributed
random variables with unknown expectation µ = EXi and unknown variance σ 2 =
Var Xi . We want to estimate µ and σ 2 , that is to construct functions µ̂(X1 , . . . , Xn )
and σ̂ 2 (X1 , . . . , Xn ) (estimates) that are ‘close’ to µ and σ 2 and therefore serve as
an ‘educated guess’ about the actual values of the parameters.
For instance, since EXi = µ, we may take the sample mean
X1 + · · · + Xn
X̄ =
n
as an estimate for µ. Clearly, E X̄ = µ. In addition, the law of large numbers
implies that X̄ converges to µ in probability. For those reasons, X̄ is a natural
249
250
estimate for the expectation. In order to estimate the variance, we may consider
n
1X
σ̂ 2 = (Xi − X̄)2
n 1
Direct computation shows that
n
1X 2
(1.1) σ̂ 2 = Xi − (X̄)2
n 1
Hence, by the law of large numbers applied to Xi and Xi2 , σ̂ 2 converges in proba-
bility to EX 2 − (EX)2 = σ 2 . Next, EXi2 = σ 2 + µ2 and E X̄ 2 = Var X̄ + (E X̄)2 =
σ 2 /n + µ2 . Therefore
n−1 2
E σ̂ 2 = σ 2 + µ2 − σ 2 /n − µ2 = σ .
n
For this reason, another statistic is often considered,
n
1 X n
(1.2) s2 = (Xi − X̄)2 = σ̂ 2
n−1 1 n−1
as n → ∞ and therefore
1
P {|θ̂ − θ| > ε} ≤ E(θ̂ − θ)2 → 0
ε2
as well.
B1.3. Cramér-Rao Inequality. Fisher information. Suppose X1 , . . . , Xn
are i.i.d. with density f (x, θ), and let θ̂ be an unbiased estimate for parameter θ.
In addition, assume that the density f (x, θ) is twice differentiable with respect to
θ. Also, we assume that the set {x : f (x, θ) > 0} does not depend on θ. Then, for
every unbiased estimate θ̂,
1
(1.3) Var(θ̂) ≥
nI(θ)
where
2
∂ log f (Xi , θ)
(1.4) I(θ) = E
∂θ
is the Fisher information. The inequality (1.3) is known as the Cramér-Rao inequal-
ity. It therefore gives a lower bound for the variance of every unbiased estimate. A
sketch of the proof is given below in the ‘Details’ subsection.
If we have even more smoothness, then the formula (1.4) can be further reduced
to
∂ 2 log f (Xi , θ)
(1.5) I(θ) = −E( )
∂θ2
In some cases, the expression (1.5) is more convenient.
If there is not one but several unknown parameters θ1 , . . . , θk , then there exists
a multivariate version of Cramér-Rao inequality which gives a lower bound for the
variance-covariance matrix for θ̂1 , . . . , θ̂k provided those estimates are unbiased.
B1.4. Efficiency. Since the Cramér-Rao inequality gives a lower bound for
the variance of an estimate, it allows us to define an efficiency of an (unbiased)
estimate. Let θ̂ be an unbiased estimate for the parameter θ. We set
(nI(θ))−1
e(θ̂) =
E(θ̂ − θ)2
where I(θ) is the Fisher information. By (1.3), 0 ≤ e(θ̂) ≤ 1 for unbiased estimates.
The efficiency (as well as the estimate itself) depends on the number of observations.
If the variance of θ̂ is exactly equal to 1/(nI(θ)) and therefore e(θ̂) = 1, then
θ̂ is called efficient, or minimal variance unbiased estimate. Whenever minimal
variance unbiased estimators exist, they could be found by the method of maximum
likelihood.
If
e(θ̂) → 1
as n → ∞, then the estimate is called asymptotically efficient.
For instance, suppose that Xi are i.i.d. normal with parameters µ, σ 2 and only
µ is unknown. As we have already seen, X̄ is a natural estimate for µ, it is unbiased
and consistent. Moreover, it could be shown that,
E(X̄ − µ)2 = 1/(nI(µ))
252
and therefore the efficiency of X̄ equals to one. If Xi are normal but σ 2 is not known
then X̄ is only asymptotically efficient; if Xi are not normal, then the efficiency fails.
For instance, if Xi have so called double exponential distribution with the density
1
f (x, µ, θ) = 2θ exp{−|x − µ|/θ}, a.k.a. Laplace distribution, then the efficiency of
X̄ is about 2/3; the best estimate here is the sample median.
B1.5. Asymptotical Normality. The last ‘good’ property we’d like to have,
is called asymptotic normality. We say that an estimate θ̂ is asymptotically normal
if
θ̂ − θ ≈ N (0, σn2 )
as n → ∞. This property allows us to test statistical hypotheses about the values of
the parameter in case if σn2 is known to us or if it could be estimated. For instance,
2
let X̄ be the sample mean. The Central limit theorem implies that X̄ ≈ N (θ, σn ),
so the sample variance is asymptotically normal, and its variance is either known
or can be estimated. As for sample variance, χ2 distribution provides a better
approximation (though it is asymptotically normal as well).
There exists a number of methods of constructing ‘good’ estimates; we’ll discuss
only one.
B1.6. Maximum Likelihood estimates. Suppose Xi are i.i.d. random
variables with density f (x; θ) where θ stands for the unknown parameter (one or
several). Then the joint density of X1 , . . . , Xn is equal to the product
f (x1 , . . . , xn ; θ) = f (x1 ; θ) . . . f (xn ; θ)
Let us plug in the data X1 , . . . Xn instead of x1 , . . . , xn :
L(θ) = f (X1 , . . . , Xn ; θ)
The resulting function is called the likelihood function and it is the function of the
unknown parameter(s) only. In order to find the maximum likelihood estimate of
θ, we maximize the likelihood function L. The point θ̂ at which L achieves its
maximum, is called the maximum likelihood estimate for θ.
Since the joint density f (x1 , . . . , xn ; θ) of X1 , . . . , Xn is a product of the mar-
ginal densities, it is more convenient to deal with its logarithm. The function
log L(θ)
is called the log likelihood function. Clearly, L and log L achieve the maximum at
the same point, but log L is, typically, easier to work with.
The method of maximum likelihood is a formalization of the following idea. If
Xi are i.i.d. with the density f (x), then most of the observations Xi should belong
to the area of the large values of the density. So we choose such values of the
unknown parameters, that most of the values f (Xi ; θ) are large and none are very
small.
Under some regularity conditions imposed on the density f (x; θ) (similar to
those required for the Cramér-Rao inequality), the maximum likelihood estimates
are consistent, asymptotically unbiased, asymptotically normal and asymptotically
efficient. Computations that support this statement, could be found at the end of
the section.
Example. Let X1 , . . . , Xn be a sample of size n from a normal distribution
with unknown expectation µ and known variance σ 2 . Then the likelihood function
253
equals
(Xi − µ)2
P
1
L(µ) = exp − .
(2π)n/2 σ n 2σ 2
Since everything except µ is a constant, L reaches its maximum when (Xi − µ)2
P
is the smallest. Equally, we could consider a log likelihood function
(Xi − µ)2
P
n
log L(µ) = − log(2π) − n log σ −
2 2σ 2
and set the partial derivative with respect to µ to zero. Either way, we get an
equation X
(Xi − µ) = 0
which yields µ̂ = X̄. As we already know, this estimate is unbiased and consistent.
Remark. In fact, the method of maximum likelihood does not require the
observations to be i.i.d. Whenever the joint density of X1 , . . . , Xn is known up
to some unknown parameters, the same procedure can be applied. Properties of
the estimates have been studied in some, but not all, cases that are not i.i.d. No
exceptions are known to the following rule of a thumb: Whenever consistent esti-
mates exist at all, maximum likelihood estimates are consistent and asymptotically
efficient. Usually, they are also asymptotically normal.
B1.7. Confidence intervals. Instead of constructing an estimate µ̂ for the
unknown parameter µ, we may prefer to construct an interval [µ̂1 , µ̂2 ] such that it
contains the actual value of the parameter with large probability. Such an interval
is called a confidence interval for the parameter µ. The number
α = P {µ̂1 ≤ µ ≤ µ̂2 }
is called the level of significance of the interval. For instance, we say that it is a
95% confidence interval if
P {µ̂1 ≤ µ ≤ µ̂2 } = 0.95
Confidence intervals are more preferable than point estimates if the sample size is
small. They sometimes could be constructed from the point estimates if we know
the distribution of the estimate.
Example. Suppose X1 , . . . , Xn are i.i.d. N (µ, 1) and
µ̂ = X̄ = (X1 + · · · + Xn )/n
Then µ̂ − µ is also normal with the expectation 0 and the variance 1/n. Therefore
√
Z = n(µ̂ − µ)
is standard normal and therefore
P {−1.96 ≤ Z ≤ 1.96} = 0.95
(we check the table of the standard normal distribution for that). Solving for µ, we
see that
1.96 1.96
{−1.96 ≤ Z ≤ 1.96} = {µ̂ − √ ≤ µ ≤ µ̂ + √ }
n n
So,
1.96 1.96
[µ̂ − √ , µ̂ + √ ]
n n
is a 95% confidence interval for µ.
254
To justify that, we need some mild conditions that would allow us to swap the
integral and the partial derivative. In a similar way, we can show that
Z ∞
∂ 2 f (Xi ; θ) ∂2
1
(1.7) E = 2 f (x; θ) dx = 0.
f (Xi ; θ) ∂θ2 ∂θ −∞
The equation (1.6) is equivalent to the identity
∂ log f (Xi ; θ)
E =0
∂θ
In turn, one can show that (1.7) implies
∂ 2 log f (Xi ; θ)
E = −I(θ).
∂θ2
Indeed,
!2
∂
∂ 2 log f (Xi ; θ) 1 ∂ 2 f (Xi ; θ) ∂θ f (Xi ; θ)
2
= −
∂θ f (Xi ; θ) ∂θ2 f (Xi ; θ)
by the chain rule. The expectation of the first term equals zero by (1.7), and the
second term is equal
2
∂
− log f (Xi ; θ)
∂θ
so its expectation coincides with −I(θ) by (1.4).
We now move to the proof of Cramér-Rao inequality. It is based on the Schwartz
inequality (A1.3). Denote
n
X ∂ log f (Xi ; θ)
B= .
i=1
∂θ
By (1.6),
E(B) = 0.
Next, B is a sum of independent identically distributed random variables with zero
expectation and therefore
n 2
X ∂ log f (Xi ; θ) ∂ log f (X1 ; θ)
Var(B) = Var = nE = nI(θ)
1
∂θ ∂θ
(we plugged in the expression for log L and divided by n). Now, let us expand the
∂
function ∂θ log f (x; θ) around θ0 . We have
∂ 2 log f (x; θ)
∂ ∂ log f (x; θ)
log f (x; θ) = + (θ − θ0 ) + O((θ − θ0 )2 )
∂θ ∂θ
θ=θ0 ∂θ2
θ=θ0
Substituting this expression into (1.9) and denoting
n
1 X ∂ log f (Xi ; θ)
B0 = ,
n i=1
∂θ
θ=θ0
n
∂ 2 log f (Xi ; θ)
1 X
B1 =
n i=1
∂θ2
θ=θ0
(see (1.7) above). Since B0 goes to zero and B1 does not, there exists a solution θ̂
to (1.10) that is close to θ0 and we can conclude from here that
B0
θ̂ − θ0 ≈ −
B1
Next, since B0 → 0 and B1 converges to −I(θ0 ), one can conclude that E θ̂ → θ0 ,
so it is asymptotically unbiased. Next, variance of θ̂0 approximately equals
1 1 1 ∂ log f (Xi ; θ) 1
Var θ̂ ≈ 2 Var B0 ≈ 2 Var =
I (θ0 ) I (θ0 ) n ∂θ
θ=θ0 nI(θ0 )
So,
1
E(θ̂ − θ0 )2 ≈
nI(θ0 )
257
and therefore
XY − X · Y
â = 2 , b̂ = Y − âX
X2 − X
B2.2. Probabilistic Motivation (Normal Linear Regression Model).
Suppose random variables Yi are independent and normally distributed with ex-
pectations aXi + b and with the same variance σ 2 where a, b and σ 2 are unknown
parameters and Xi are known (non-random) numbers. Let us construct the max-
imum likelihood estimates for the parameters a, b, σ 2 . The density of the random
variable Yi is given by the formula
1 (y − aXi − b)2
fYi (y) = √ exp{− }
2πσ 2σ 2
Since Yi are independent, the joint density of Y1 , . . . , Yn is equal to the product of
the marginal ones:
Pn 2
2 1 i=1 (yi − aXi − b)
FY1 ,...,Yn (y1 , . . . , yn ; a, b, σ ) = exp{− }
(2π)n/2 σ n 2σ 2
Therefore the log likelihood function is equal to
Pn
2 n (Yi − aXi − b)2
log L(a, b, σ ) = log F (Y1 , . . . , Yn ) = − log(2π) − n log σ − i=1
2 2σ 2
n Q(a, b)
= − log(2π) − n log σ −
2 2σ 2
where Q is given by (2.1). For every σ, the maximal value of log L corresponds to
the minimal value of Q and therefore the maximum likelihood estimates of a and
b coincide with the least squares estimates found above. However, we are able to
estimate σ 2 as well. Namely, let â, b̂ be the least squares estimates. Differentiating
with respect to σ, we get the equation
∂ log L n Q(â, b̂)
=− + =0
∂σ σ σ3
which implies
Q(â, b̂)
σ̂ 2 = .
n
Note that this estimate is biased (but asymptotically unbiased, consistent and
asymptotically efficient). In order to get an unbiased estimate, we should divide by
n − 2 instead:
Q(â, b̂)
s2e =
n−2
2
The estimate â also admits a probabilistic interpretation. Since (X 2 − X )/n
and (XY − X · Y )/n are the estimates for the variance of X and the covariance
CXY of X and Y , it could be written as
ĈXY
â = 2
σ̂X
B2.3. Multiple linear regression. If Y depends on several factors X1 , . . . Xk ,
we arrive at multiple linear regression. For instance, assume that random variables
Yi are independent and normally distributed with expectations a0 +a1 Xi1 +a2 Xi2 +
. . . ak Xik and the same variance σ 2 . Here Xi1 , . . . , Xik are certain external factors
for the observation i (their values may change when we move to the next data
259
Exercises
1. Let X1 , . . . , Xn be a sample of size n from a distribution with expectation µ
and variance σ 2 and let µ̂ = (2X1 + X2 + · · · + Xn−1 + 2Xn )/(n + 1) be an estimator
for µ. Is it unbiased? asymptotically unbiased? consistent?
2. Let X1 , . . . , Xn be a sample of size n from a normal distribution with
parameters µ, σ 2 . Assuming σ 2 is known, show that the sample mean X̄ is a
minimal variance unbiased estimate for µ. To this end, compute the corresponding
Fisher information and compare the variance of the sample mean with the Cramér-
Rao bound.
3. Let X1 , . . . , Xn be a sample of size n from a normal distribution with
parameters µ, σ 2 . Assuming µ is known, construct a maximum likelihood estimate
for σ 2 . Is it unbiased? Asymptotically unbiased? Consistent?
4. Let X1 , . . . , Xn be a sample of size n from a normal distribution with
parameters µ, σ 2 . Construct maximum likelihood estimates for µ and σ 2 . Are they
unbiased? Asymptotically unbiased? Consistent?
5. Let X1 , . . . , Xn be a sample of size n from a Gamma distribution with
parameters α and θ (see Section A2.2). Assuming α is known, construct a maximum
likelihood estimate for θ.
6. Let X1 , . . . , Xn be a sample of size n from a distribution with the density
( β
αβxe−αx if x > 0
f (x, α, β) =
0 otherwise
260
where α > 0 and β > 0 (so called Weibull distribution). Assuming β is known, find
a maximum likelihood estimate for α.
7. Let X1 , . . . , Xn be a sample of size n from a distribution with the density
(
α
α+1 if x > 1
f (x, α) = x
0 otherwise
where α > 0 (so called Pareto distribution). Find a maximum likelihood estimate
for α.
8. Let X1 , . . . , Xn be a sample from normal distribution with parameters
mx , σ 2 and let Y1 , . . . , Yn be another sample from a normal distribution with pa-
rameters mY , σ 2 (variance is the same for both samples, samples are independent
from each other). Find the maximum likelihood estimates for mX , mY and σ 2 .
9*. Let X1 , . . . , Xn be a sample of size n from a distribution with the density
1
f (x, µ, θ) = exp{−|x − µ|/θ}
2θ
where θ > 0 (Laplace distribution). Find maximum likelihood estimates for µ and
θ.
10. For a normal linear regression model, verify that â and b̂ are unbiased
estimates for a and b.
11. For a multiple linear regression model, verify that Q(Â) = Y0 Y − Â0 X0 Y.
APPENDIX C
1. Basics
A complex number is an expression of the form a + ib where a, b are real num-
bers and i is the imaginary unit (an element with the property i2 = −1). Basic
operations with complex numbers are defined as follows:
(a1 + ib1 ) + (a2 + ib2 ) = (a1 + a2 ) + i(b1 + b2 )
(a1 + ib1 ) − (a2 + ib2 ) = (a1 − a2 ) + i(b1 − b2 )
(a1 + ib1 )(a2 + ib2 ) = (a1 a2 − b1 b2 ) + i(a1 b2 + a2 b1 )
We interpret a complex number z = a + bi as a point on the plane with rectangular
coordinates (a, b) (and as a corresponding vector as well). If z = a + bi, then
a = Re z is called the real part of z and b = Im z is the imaginary part of z.
Complex numbers of the form a = a + i0 are just real numbers. For this reason,
the horizontal axis is called the real axis. Numbers ib = 0 + ib are called purely
imaginary (and the vertical
√ axis is called the imaginary axis).
The distance |z| = a2 + b2 from the origin to (a, b) is called the absolute value
of z. In particular, one can check that |z1 z2 | = |z1 ||z2 |.
For a complex number z = a + ib, a number z̄ = a − ib is called a complex
conjugate for z. Geometrically, z̄ is a reflection of z in the real axis. It could be
easily seen that
z + w = z̄ + w̄
zw = z̄ w̄
z z̄ = |z|2
The last relation allows us to compute a reciprocal
1
z −1 = 2 z̄
|z|
It also helps us to divide complex numbers:
z z w̄ 1
= = z w̄
w ww̄ |w|2
Complex exponents. For real numbers x, y, we set
ex+iy = ex (cos y + i sin y)
There is a number of reasons to do so. One of them is a power series representation
z2
ez = 1 + z + + ...
2!
261
262
Also,
ez1 +z2 = ez1 ez2
With help of complex exponents, an arbitrary complex number can be represented
as
(1.1) z = reiθ
where r = |z| is the absolute value of z and θ, called the argument of z, is defined
up to a multiple of 2π. The representation (1.1) is called the exponential form of a
complex number.
In terms of complex exponents, we have
1 1
cos x = (eix + e−ix ), sin x = (eix − e−ix )
2 2i
for every real number x.
Geometrically, the condition |z| = 1 defines a unit circle. Its elements can be
represented in the form
z = cos θ + i sin θ = eiθ
where θ is the angle between the real axis and the vector z. Elements eiθ play a
special role. In particular,
eiω
eiω eiθ = ei(ω+θ) , eiω = e−iω , = ei(ω−θ) , ei2πn = 1
eiθ
One of the fundamental results of the theory is as follows. Every polynomial of
the order n has n roots (counting the multiplicities). In particular, every polynomial
can be uniquely factorized into the product of linear functions:
p(z) = an z n + · · · + a0 = an (z − z1 ) . . . (z − zn )
if an 6= 0. If the coefficients of the polynomial are real numbers, then p(z) = p(z̄)
and therefore if p(z) = 0, then p(z̄) = 0. That is, if a root of p(z) is not a real
number, then its conjugate is also a root.
Exponential form and polar coordinates. If reiθ is the exponential form
of a complex number z = x + iy, then r and θ are polar coordinates of the point
with rectangular coordinates (x, y). Vice versa, if (r, θ) are the polar coordinates
of the point (x, y) and if r > 0, then r = |x + iy| is the absolute value of z = x + iy
and θ is the argument of z. The only difference between polar coordinates and
exponential representation of a complex number is that, in polar coordinates, r is
allowed to be negative.
Roots of unity. That is a name for the solutions of the equation
zn = 1
Suppose z = reiθ . The equation becomes z n = rn einθ = 1 and therefore r = 1, nθ =
2πk, so θ = 2πk
n . So, all the solutions to the equation have the form
wk = ei(2πk)/n , k = 0, 1, 2, . . . , n − 1.
In a similar way, we can solve an equation z n = z0 . Namely, if z0 = r0 eiθ0 , then all
the solutions to z n = z0 are given by the formula
z = (r0 )1/n eiθ0 /n wk , k = 0, 1, 2, . . . , n − 1
where wk are the roots of unity.
Convergence of complex numbers. Convergence of a series. We say
that zn → z as n → ∞ if |zn − z| → 0. Since Re z ≤ |z| ≤ Re z + Im z and the
263
Exercises
1. Simplify
1+i
, (1 + i)(1 − 2i)
1−i
2. Convert the following numbers to the exponential form:
√
−1 + i, 2i, 1 + i 3, −0.5
3. Convert the following numbers to the usual (that is, a + bi) form:
1 √
eiπ , √ ei3π/4 , e−iπ , 2e−iπ/4
2
4. Which of the following numbers are inside the unit circle? Which are
outside? Which are on the unit circle?
1
1 + i, −i, √ ei3π/4 , 0.6 + 0.8i, e0.3i
2
5. With help of trig identities, verify that, for real numbers x, y,
eix eiy = ei(x+y)
6. By multiplying and dividing by (1 − z), verify that
1 − z n+1
1 + z + z2 + · · · + zn =
1−z
if z 6= 1.
7. Let θ = 2πk/n where k, n > 0 are some integers. Show that
eθi + e2θi + · · · + enθi = 0
if k is not a multiple of n, and
eθi + e2θi + · · · + enθi = n
otherwise. By taking real and imaginary parts, conclude that
cos θ + cos 2θ + · · · + cos nθ = 0,
sin θ + sin 2θ + · · · + sin nθ = 0
if k is not a multiple of n, and
cos θ + cos 2θ + · · · + cos nθ = n,
sin θ + sin 2θ + · · · + sin nθ = 0
otherwise.
264
9. Let
p(z) = an z n + an−1 z n−1 + . . . a1 z + a0
and let
q(z) = an + an−1 z + · · · + a1 z n−1 + a0 z n
Show that, if p(z) = 0, then q(z −1 ) = 0 and vice versa. If
p(z) = an (z − z1 ) . . . (z − zn )
(hence z1 , . . . , zn are the roots of p(z)), then
q(z) = an (1 − zz1 ) . . . (1 − zzn )
is called a power series (to be precise, it is a power series about the origin, but
those are the only ones that we need). Important facts about power series:
1. There exists a number R, 0 ≤ R ≤ ∞, such that the power series (2.1)
converges absolutely for all z such that |z| < R, and diverges for all z such that
|z| > R. The number R is called the radius of convergence. Note that this property
does not say anything about those z with |z| = R.
265
them out and cancel). Let now R > 0 be the smallest of the absolute values of the
roots of the denominator q(z). Then f (z) can be represented as a sum of a power
series, with the radius of convergence that is exactly equal to R. To show that, we
should factor q(z) into the product of linear factors q(z) = bm (z − z1 ) . . . (z − zn )
1
and then represent f (z) as a linear combination of simple fractions (z−z i)
k (partial
Exercises
For each of the following functions, find its representation as a sum of a power
series. Find the corresponding radius of convergence.
267
1+z
1. f (z) = 1−2z
1
2. f (z) = (1−2z)2
1+z
3. f (z) = 2−3z+z 2 .
1
4. f (z) = 1+z+z 2 .
1
5. f (z) = 1−z+z 2 −z 3 .
6. f (z) = √1 .
1−2 3z+4z 2
3. Difference Equations
Difference equations play an important role in time series analysis when we
study autocorrelation function of a stationary series.
We say that a sequence xn , n = 0, 1, 2, . . . satisfies a difference equation of the
order k, if, for all n ≥ k,
(3.1) xn + a1 xn−1 + · · · + ak xn−k = 0
where a1 , . . . , ak are some (known) coefficients. We will assume that all the coeffi-
cients are real numbers, and the senior coefficient ak 6= 0. The relation (3.1) also
goes under the name ‘homogeneous linear recurrence relation with constant coeffi-
cients’. The equation (3.1) defines the sequence recursively; we need to know the
first k values (so called ‘seed values’). Nonetheless, we’d like to have a closed-form
formula for a general solution.
We begin with an associate characteristic polynomial
(3.2) p(z) = z k + a1 z k−1 + · · · + ak
It has k zeroes, some of them could be complex numbers. However, if p(z) = 0,
then p(z̄) = 0 as well. Hence, we have a collection of real roots, possibly with mul-
tiplicities, and a collection of complex roots that come in pairs (z and its conjugate
z̄), also with multiplicities.
A general solution to (3.1) could be written in terms of the roots of (3.2). Let
us suppose first that all the roots z1 , . . . , zk are real and distinct. Then a general
solution to the equation (3.1) is given by the formula
k
X
(3.3) xn = Cj zjn
j=1
So, in this case, to each real root z, there corresponds a solution z n ; to each pair
of conjugate complex roots z = r(cos(θ) ± i sin(θ)), there correspond two solutions
that are equal to rn cos(nθ) and rn sin(nθ) or, which is the same, to the real and
the imaginary parts of the sequence z n .
Situation is more complicate if not all of the roots are distinct. Suppose the
equation (3.2) has d distinct real roots z1 , . . . , zd with multiplicities mj , so that
m1 + · · · + md = k. The general solution to the equation (3.1) is given by the
formula
d m j −1
X X
(3.5) xn = Cj,r nr zjn
j=1 r=0
Exercises
1. (Fibonacci numbers) Let x1 = x2 = 1 and let xn = xn−1 + xn−2 , n ≥ 3.
Find a closed-form representation for xn .
2. Let x1 = x2 = 1 and let
xn = xn−1 + 0.25xn−2 , n ≥ 3.
Find a closed-form representation for xn .
3. Let again x1 = x2 = 1 and let
√
xn = 2 3xn−1 + 4xn−2 , n ≥ 3.
Find a closed-form representation for xn .
4. Let now x1 = 0, x2 = x3 = 1 and let
xn = xn−1 − xn−2 + xn−3 , n ≥ 4.
(a) Find a closed-form representation for xn . (b) Do the same with x1 = x2 = x3 =
1.
5. Let again x1 = x2 = x3 = 1 and let
xn = 6xn−1 − 12xn−2 + 8xn−3 , n ≥ 4.
Find a closed-form representation for xn .
6*. Let x1 = x2 = 1, x3 = x4 = 0 and let
√ √
xn = 4 3xn−1 − 20xn−2 + 16 3xn−3 − 16xn−4 , n ≥ 5.
Find a closed-form representation for xn .
Index
271
272
variance, 233
variance, conditional, 236