4540 17 PDF

Math 4540/5540
Introduction to Time Series Analysis
S. E. Kuznetsov
Department of Mathematics, University of Colorado,
Boulder
E-mail address: Sergei.Kuznetsov@Colorado.edu
(c) Sergei Kuznetsov, 2009-2017

Contents
Introduction 5
Chapter 1. Trend-based Models 17

1. Parametric Models 17
2. Moving Averages 26
3. Exponential Smoothing 33
4. Stationarity Tests 45
Chapter 2. Stationary Models 51

1. Stochastic Processes and Stationarity 51
2. Moving Average Processes 57
3. Autoregression of the First Order 62
4. Back Shift Operator 65
5. Second Order Autoregression 67
6. General Autoregression Model 78
7. Moving Average Processes Revisited. Invertibility Condition 82
8. General ARMA Processes 85
9. Non Stationary Processes and ARIMA (Box—Jenkins) Models 89
10. Prediction in a Stationary Model 93
Chapter 3. Stationary Models: Estimation 99

1. Time Series and Stochastic Processes. Estimation of Mean and Variance
of a Time Series 99
2. Estimation of ACV, ACF and PACF 101
3. Properties of the Sample ACF and Sample PACF 107
4. Estimation of the Parameters of the Autoregression Model 115
5. Estimation of Parameters of Moving Average and ARMA models 118
6. Distribution of the Estimates and Confidence Regions* 120
7. Comparison of Models 122
Chapter 4. Spectrum and its Estimation 137

1. Spectral Density of a Stationary Process. Examples 137
2. Periodogram and its Properties 149
3. Estimation of a spectrum. Properties of the estimates 158
4. Estimation Details 170
Chapter 5. Filters 179

1. Classification of Filters 179
2. Construction of the filters 186
3. Filters and Phase shift 198
3
4
Chapter 6. Seasonality 205

1. Models and Methods 205
2. Examples 209
Chapter 7. Multivariate models 219
1. Cross-covariance, cross-correlation and cross-spectrum 219
2. State-Space models. Kalman Filter 224
Appendix A. Elements of Probability 231

1. Basic Concepts 231
2. Important Probability distributions 239
3. Laws of large numbers and Central Limit Theorem 244
4. Convergence in mean squares 245
Appendix B. Elements of Statistics 249

1. Basics 249
2. Linear Regression and Least Squares 257
Appendix C. Complex Variables Essentials. Difference equations 261
1. Basics 261
2. Power series and Analytic functions 264
3. Difference Equations 267
Appendix. Index 271
Introduction
A time series is a collection of observed values of certain characteristics that

develops in time. We are surrounded by time series—they come from the economy,
environment, traffic, technologies, and so on.
Although time series analysis is a part of statistics, there is a difference between
them. In statistics, we deal with samples, that is with the data collected while we
repeat the same random experiment independently, again and again. Statistical
samples may look similar to a time series. However, in a sample, observations are
independent and identically distributed (and all statistical methods are based on
that assumption). In a time series, observations are correlated with each other and
with time, and their distribution may change in time.
There is one common agreement about time series: observations are supposed to
be equally spaced in time, such as daily or monthly or annual, data. Naturally, there
are exceptions. For instance, certain physical characteristics (such as temperature
or pressure) may be registered as a continuous curve. However, such a data has
to be converted into a discrete series before we can do any numerical analysis.
As another example, stock market data are typically indexed by business days
instead of calendar days. The methods of time series analysis still apply. In some
cases, intervals between consecutive, aperiodic events (such as major accidents,
earthquakes or volcanic eruptions) may be treated as a time series.
Your first encounter with time series analysis could be as follows. Your boss
stops by your desk and says, here is the data, we need a prediction. You navigate
the web, download some software and, in five minutes, get your first result. But,
it does not look right! In the least, it is not a good idea to show it to the boss.
You change the options, get another one, which looks wrong as well. You perform
an extensive search on the web, download more software, try to play with options
and parameters. You get a wide range of predictions, some of them look definitely
wrong, some look reasonable (See Figure 1). And then, you realize that the final
decision is going to be yours—it is you who has to choose the one to be sent upstairs.
But how? Which criterion should be used? What do you need to know in order to
make it reasonable?
Examples. The following examples demonstrate how diverse could be the data
structure and the objectives of a study.
1. Monthly sales of a dry-cleaning company (12 years, 144 data points, Figure
2). We are interested here in a model that gives a good short-term prediction. This
is a series of a relatively simple structure. The sales’ volume grows in time. (Does
it grow linearly? Due to the expansion of the network? Due to the population
growth?). On top of the overall sales growth, we have monthly effects (for instance,
January is, approximately, five percent of annual sales). These monthly effects are
very stable, they practically do not change from year to year. Also, monthly effects
5
6
12
10
4
10 20 30 40
Figure 1. Beginner’s trouble. Data set and predictions, which

one to choose?
Sales
12 000
10 000
8000
6000
4000
2000
Months
0 20 40 60 80 100 120 140
Figure 2. Monthly sales of a dry cleaning company, 12 years
are multiplicative, proportional to the achieved level. Trend-based models work

very well here (see Chapter 6, section 1.1).
2. Monthly totals of air passengers (12 years, 144 points, Figure 3). This
famous series looks very similar to the previous one. It looks like we have a trend
(linear? or more likely quadratic?) and monthly effects. However, monthly effects
are not so stable here because some holidays (such as Easter) float around and may
move from one month to another. The best way to proceed would be to include
some extra factors (dates of important holidays) into the model. However, if that
information is not available, we can use a model of different structure here, the
so-called seasonal version of an ARIMA model (see Chapter 6, section 1.3).
3. Sunspots (Annual data, 176 years, Figure 4; monthly data, 65 years, 778
data points, Figure 5). Those data are related to the electromagnetic activity of
the Sun. Monthly data represent a number of sunspots over the month, annual
data are so-called Wolf numbers, annual averages of the monthly ones. Sunspots
are not visible to a naked eye, they are actually somewhat colder spots on the
surface of the Sun. Sunspots represent an electromagnetic disturbance. The average
number of sunspots is related to the level of electromagnetic activity of the Sun,
and that, in turn, heavily affects our life on Earth (climate, large-scale weather
cycles, health, electromagnetic field, communication and navigation, density of the
air at high altitude and many other things). The data show a clearly visible eleven
years cycle. But the magnitude of the cycle fluctuates—some other cycles could
be revealed. All of us are highly interested in the prediction here. Unfortunately,
7
Data
600
500
400
300
200
100
Months
20 40 60 80 100 120 140
Figure 3. Air passengers, 12 years
Sunspots
150
100
50
Years
1750 1800 1850 1900
Figure 4. Sunspots (annual data)
Sunspots
250
200
150
100
50
Years
1950 1960 1970 1980 1990 2000 2010
Figure 5. Sunspots (monthly data). You can see now that eleven
years cycle begins with steep rise followed by not so steep decay.
No commonly used model describes this type of behavior, this data
requires a custom made non-linear model.
this series is much harder to predict. Unless we develop a model that describes the
process from the physical point of view, we can only build an autoregressive model
based on correlations between consequent values of the series. However, if we look
at the graph, we see that the shape of the main cycle is not symmetric; the series
goes up steeply and then decays at a slower rate. This effect becomes even more
transparent if we switch to monthly data (Figure 5). Such type of behavior can be
described only by a custom made non-linear model, commonly used models can’t
explain this type of behavior.
4. Monthly averages of an exchange rate (Pound vs US $, sometime back in
eighties) ( 10 years, 120 data points, Figure 6). All series of this type are hard to
predict. In the financial world, it is considered a success if you can predict just a
sign of the increment of a series (Will it go up or down? Should we buy or should
we sell?) with probability 51.5 %.
5. Electromagnetic activity of a human brain (EEG) (1 minute, 7700 data
points, Figure 7). This particular data set was a part of a psychological experiment.
A person is waiting for 20 seconds, then a picture appears and, during the next 20
8
Exchange Rate
2.5
2.0
1.5
1.0
Months
0 20 40 60 80 100 120
Figure 6. Exchange Rate
EEG
1000
500
Seconds
18 19 20 21
-500
-1000
Figure 7. EEG data (20 seconds was supposed to be a structural

change point)
Periodogram
1.5 × 107
1.0 × 107
5.0 × 106
Frequency (Hz)
0 10 20 30 40 50 60
Figure 8. Spectrum of the EEG data. The high peak on the left
is a resonance, the peak on the right is the power frequency (50 Hz
in Europe)
seconds, the person tries to memorize it, then the picture disappears and he/she
tries to reconstruct it. The idea of the experiment was to relate the data properties
to the type of the brain activity. The project itself was unsuccessful (an attempt to
find a significant difference in the behavior of the series failed). However, spectral
analysis of the data (Figure 8) revealed the presence of two periodic components:
one component corresponds to the frequency of the power generator, the second is
a resonance.
9
Power Output
6000
5500
5000
4500
4000
Days
0 2 4 6 8 10
Figure 9. Power Plant data (about 2 weeks). Note clearly visible

weekends, and daily cycle
Power Output
6000
5500
5000
4500
4000 Days
6.0 6.5 7.0 7.5
Figure 10. Power plant data revisited. Another problem here -

a big chunk of the data (about 5 hours) is missing
6. Power plant data (demand). (Time step is 1 minute, 48,000 points, Figure 9).
We are interested in prediction—you should have some reserve of power, otherwise
the electric grid may get overloaded.However, you should not overdo that: A reserve
is expensive, for one thing, and also causes some trouble to the electric grids. On
Figure 9, you can clearly see daily and weekly cycles. However, as you can see on
Figure 10, some data are missing. Normally, a few missing values here and there
could be handled, they use so-called Kalman filter for that. But here, a five hours
chunk of data is missing (300 data points in a row).
7. Radioactive decay model, 450 points. This is a simulated data (Figure 11)
that has been used in order to test some estimation methods. The model behind the
data is Xt = C1 exp(−λ1 t)+C2 exp(−λ2 t)+C3 exp(−λ3 t)+noise, which corresponds
to a radioactive decay of a mixture of three radioactive substances with different
half–lives characterized by the parameters λ1 , λ2 and λ3 . The trouble here is that
none of the parameters of the model can be estimated consistently. For this data,
we have assumed that there is a large amount of a substance with short half–life,
a moderate amount of a substance with medium half–life and even smaller amount
of a substance with long half–life. So we have used the initial portion of the data
in order to estimate the first component, then tried to use the middle section of the
data in order to estimate the second component (and to refine the estimate for the
first component as well), and, finally, tried to get the third component out of the
10
Data
10
0.5
Time
0 100 200 300 400
Figure 11. Radioactive decay data (logarithmic scale)
1500
1000
500
0 100 200 300 400
Figure 12. Profile data
Data
18
16
14
12
10
Years
1910 1920 1930 1940 1950 1960
Figure 13. Ocean level (annual data). Note how rounding can
kill all the information but long term trend
rest of the data. We have only managed to get the first two components, as the
last portion of the data was obscured by the noise.
8. A profile of a surface (Figure 12). Though not a time series, such data can
also be analyzed by the methods of Time Series analysis. Objective of the study
was to test a new technology (they wanted to compare characteristics of different
profiles).
9. Water level data (the ocean, Dead Sea, some lakes, Figures 13-17). Annual
data—a part of a study of the climate changes. More than a hundred of major
lakes around the globe were used. One of the problems here was that the available
data had little overlap in time.
Stationary and Non-stationary series. A series is called stationary if,
broadly speaking, its characteristics and behavior do not change in time. A sta-
tionary series still may contain cycles, that is oscillations that look periodic but
can’t be attributed to the calendar.
11
Data
600
500
400
300
200
100
Years
1910 1920 1930 1940 1950 1960
Figure 14. Dead Sea level (annual data)
Data
300
200
100
Years
1910 1920 1930 1940 1950 1960
Figure 15. Salt Lake level (annual data)
Data
70
60
50
40
30
20
10
Years
1880 1900 1920 1940 1960
Figure 16. Lake Ontario (annual data)
A non-stationary series may contain trend (broadly speaking, a non-random,

long term changes in the mean level) or seasonal effects, that is variations that are
annual or otherwise strictly periodic (daily, weekly, monthly etc.).
In the examples above, sunspots (Example 3) and the EEG data (Example 5)
represent stationary series, sunspots have a clearly visible 11 years cycle (in fact,
other, longer, cycles also exist). Dry cleaning sales data, as well as air passengers
data, is a series with trend, and seasonal (monthly) effects. The exchange rate
data is a series that is non-stationary, however it is not likely that we can call it
a series with trend. The power plant example contains two seasonal components
12
Data
300
250
200
150
100
50
Years
1860 1870 1880 1890 1900
Figure 17. Lake Huron (annual data)
(day-night effects and business days - weekends effects). The series is too short to
see the annual cycle that must be there as well (and we should probably expect that
the shape of the daily cycle changes from season to season, for instance because of
the change in sunrise and sunset times).
Methods: Theoretical and Empirical. There is a wide variety of mod-
els, methods and algorithms used in time series analysis. Each model describes a
particular type of behavior of the data, and the model is not going to work if the
situation is different. It is up to researchers to choose model(s) that might work.
In a sense, models and methods of Time Series Analysis can be split into three
categories. A first group contains methods and algorithms that are based on models,
with assumptions that can be justified (statistically tested). Such models are there-
fore completely statistically and theoretically legitimate. A second group contains
algorithms, which, although based on some statistical assumptions, are commonly
used in situations that differ, somewhat or significantly, from the original assump-
tions. Finally, some algorithms are purely empirical, there is no model behind them,
they just look reasonable, and in certain situations, produce reasonable results. We
will discuss here basic, commonly used, models and algorithms.
Visual Analysis First! Your first step should be to examine the graph of the
series. It may help you to decide what kind of data you’ve got, and maybe to spot
something which should be addressed beforehand. The following data (Figure 18),
represents a daily production of cement (in a region?). Due to the nature of the
process, the production should not be affected by weekends. However, we can see
here a strange pattern in the middle of the series, two points way below (those are
Saturdays and Sundays) and one point way up (those are Mondays). We can easily
guess that somebody responsible was away for four weeks, and weekend production
was reported on Mondays. The outliers are clearly visible on the graph, but are
hard to spot otherwise. Moreover, if we do not pay attention to that and do a
correlation and/or spectral analysis (Chapters 3 and 5), we may wrongly interpret
the results as an evidence of a seven day cycle (Figures 19 and 20), which definitely
may appear in such a data as daily production (but not in this one).
Preliminary transformations. In the earlier days, a preliminary transfor-
mation was widely used in order to get a simpler model or maybe better properties
of the model (e.g. better prediction error).
13
Production Data
550
500
450
400
350
300
Days
60 80 100 120 140
Figure 18. Daily production data (with outliers)
ACF
1.0
0.5
Lags
1 5 10 15 20 25 30
-0.5
-1.0
Figure 19. Autocorrelation of the daily production data. Seventh

value (as well as 14th and even 21st) is way off the critical region.
Periodogram
1000
800
600
400
200
Periods
7 3.5 2.33 2
Figure 20. Periodogram (sample spectrum) of the daily produc-

tion data. The highest peak corresponds to the 7 day period.
Box-Cox transformation was commonly used for that purpose. It depends on

(λ) xλ −1 (λ)
the parameter λ and is defined as follows: xt = tλ if λ 6= 0 and xt = log xt if
λ = 0 (so that it is continuous in λ).
Today, we recommend generally avoiding transformations. They should be
used only if the result of the transformation admits an interpretation (physical,
economical, a percentage of something, a rate of whatever change and so on). As a
rule, logarithmic transformation is the only one that fits. It should be used if the
data contain an exponential trend or if the data contains trend, and the magnitude
14
Sales
1.2 × 104
1.0 × 104
8000
6000
4000
Months
0 20 40 60 80 100 120 140
Figure 21. Sales of the dry cleaning services in logarithmic scale

(compare with Figure 2).
Crops Data
400
300
200
100
Years
1500 1550 1600 1650 1700 1750 1800 1850
Figure 22. Price index (crops), annual data, 1500-1869.
of random or seasonal component(s) looks proportional to the data. This does not
happen often, it is more typical for a series with seasonal effects (or maybe, it is
just more visible in a series with seasonal effects). You can usually see such things
on the graph of a series. Namely, if the graph in the logarithmic scale looks more
homogeneous than the graph in the ordinary scale, then a log transformation may
be considered. However, this is not a common situation. Only two examples from
the above fit this description, and that is examples 1 and 2, dry cleaning service and
air passengers, see Figures 2 and 3. You can see here that the magnitude of seasonal
oscillation grows with time and is approximately proportional to the current value
of the trend (compare the graph of the series in the plain scale (Figure 2) and in
log scale (Figure 21)).
Another example of this type is the crops price index (annual data, 1500 - 1869
(Yes, 370 years!)). Once again, we can see that the magnitude of the random oscil-
lation grows with the trend. In log scale, the magnitude of the random oscillation
seems to be more or less constant. See Figures 22 and 23.
15
Crops Data
100
50
Years
1500 1550 1600 1650 1700 1750 1800 1850
Figure 23. Price index (crops) in log scale.

CHAPTER 1
Trend-based Models
Trend models are the simplest models that describe non-stationary series. Ac-
cording to the trend model, the observations Xt have the structure
Xt = f (t) + εt
where εt is the noise (say, independent and identically distributed (in short, i.i.d.)
random variables with Eεt = 0) and f (t), the trend, is a ‘nice’ non-random function.
Trend-based methods can be divided into two groups, depending on the length
of the data. If the series is short (few tens of points), we can only approximate
f (t) by a parametric curve, for example, fit a linear trend. If the series is long, we
typically can’t parameterize the trend, and we have to use non-parametric methods.
1. Parametric Models
A typical short series represents some annual data (economical, demographical,
etc.) Those data could be annual averages. It is also possible that the information
is available just once a year, like harvest data. For a short series, we usually fit an
appropriate parametric model, that is, we assume that f (t) is known to us up to a
few unknown parameters, for instance, that it is linear.
Rule of a thumb: As everywhere in statistics, we should have at least ten data
points per parameter.
A simplest possible model is a linear trend model. We shall discuss it in detail.
We would like to fit a linear trend model to the data X1 , . . . , XN . We assume that
(1.1) Xt = a + bt + εt
where εt are i.i.d. with zero expectation (so called white noise). We have three
unknown parameters to estimate: the coefficients a, b and the variance of the noise
σε2 . The typical way to do this is to use least squares.
It is more convenient to rewrite the equation (1.1) as
(1.2) Xt = a0 + b(t − t̄) + εt
where t̄ = (N + 1)/2 and a0 = a + bt̄ (roughly speaking, we just use the midpoint
PN
of the time interval as the origin). Note that t=1 (t − t̄) = 0. According to the
method of least squares, we would like to minimize the sum
N
X
Q(a0 , b) = (Xt − a0 − b(t − t̄))2
t=1
17
18
So, we differentiate with respect to a0 and b and set the partial derivatives to zero.
Differentiating with respect to a0 , we get an equation
N N N
∂Q X
0
X
0
X
0= = −2 (Xt − a − b(t − t̄)) = −2 Xt + 2N a + 2b (t − t̄)
∂a0 t=1 t=1 t=1
(1.3)
N
X
= −2 Xt + 2N a0 ,
t=1
PN
since t=1 (t − t̄) = 0. So we have
N
1 X
(1.4) â0 = X̄ = Xt
N t=1
Differentiating with respect to b and substituting X̄ for a0 , we get an equation

N
∂Q X
(1.5) 0= = −2 (t − t̄)(Xt − X̄ − b(t − t̄)).
∂b t=1
It can be rewritten as
N
X N
X
(t − t̄)(Xt − X̄) = b (t − t̄)2
t=1 t=1
which implies
PN
t=1 (Xt − X̄)(t − t̄)
(1.6) b̂ = PN
t=1 (t − t̄)2
or
PN PN PN
t=1 (t − t̄)Xt − X̄ t=1 (t − t̄) t=1 (t − t̄)Xt
(1.7) b̂ = PN = PN
− 2 2
t=1 (t t̄) t=1 (t − t̄)
PN
because t=1 (t − t̄) = 0.
Now, (1.4) and (1.2) imply
N N
1 X 1 X 0
â0 = Xt = (a + b(t − t̄) + εt ) = a0 + ε̄
N t=1 N t=1
PN
where ε̄ = N1 t=1 εt is the average of εt . In a similar way, from (1.7) and (1.2) we
are getting
PN PN
t=1 (t − t̄)Xt (t − t̄)(a0 + b(t − t̄) + εt )
b̂ = PN = t=1 PN ,
2 2
t=1 (t − t̄) t=1 (t − t̄)
PN PN PN
a0 t=1 (t − t̄) + b t=1 (t − t̄)2 + t=1 (t − t̄)εt
(1.8) = PN ,
2
t=1 (t − t̄)
PN PN PN
b t=1 (t − t̄)2 + t=1 (t − t̄)εt (t − t̄)εt
= PN = b + Pt=1 N
.
2 2
t=1 (t − t̄) t=1 (t − t̄)
19
From those equations, it follows (Problem 1) that

1 2
Eâ0 = a0 , Var â0 = σ ,
N ε
(1.9) σε2
E b̂ = b, Var b̂ = PN ,
t=1 (t − t̄)2
Cov(â0 , b̂) = 0.
In addition, the estimates have asymptotically normal distribution. The last line
in (1.9) is, in fact, the main advantage of the equation (1.2) over (1.1). Indeed,
if the distribution of the noise εt is Gaussian, a.k.a. normal, then the estimates â0
and b̂ have a bivariate Gaussian distribution. However, since Cov(â0 , b̂) = 0, they
are independent (see A2.5).
It remains to estimate σε2 . An unbiased estimator for σε2 could be found by the
formula
Q(â0 , b̂)
(1.10) σ̂ε2 =
N −2
(see below in the ‘Details’ subsection). Note that Xt − â0 − b̂(t − t̄) represents the
residuals in the model (1.2). So, Q(â0 , b̂) is actually the minimal possible value for
the sum of squares of the residuals. In order to get an unbiased estimate for σε2 ,
we have to divide Q(â0 , b̂) by the number of observations reduced by the number of
estimated parameters which, in our case, is equal to two.
Testing hypotheses about the values of the parameters. For the linear
trend model, the most important thing to test is the hypothesis H0 : b = 0 (indeed,
if b = 0 then there is no trend at all, the series is stationary).
Suppose that the noise εt is Gaussian. It could be shown (see the sketch in the
‘Details’ subsection) that the random variable U = Q(â0 , b̂)/σε2 has a χ2 distribution
with N − 2 degrees of freedom and it is actually independent from â0 and b̂. Also,
â0 and b̂ are independent and Gaussian with parameters given by (1.9). For this
reason,
b̂ − b
Z= pP
σε / (t − t̄)2
is standard normal and therefore
Z
(1.11) t= p ,
U/(N − 2)
has t distribution with N − 2 degrees of freedom. However, the expression in (1.11)
can be simplified as follows:
√b̂−b pP
Z (t − t̄)2 (b̂ − b)
P
σε / (t−t̄)2
p = √ =
U/(N − 2) Q(â0 ,b̂)/(N −2) σ̂ε
σε
(here σ̂ε2 is given by (1.10)). So, in order to test the hypothesis H0 : b = 0, we

should compute the statistics
pP
b̂ (t − t̄)2
(1.12) t=
σ̂ε
20
and then compare it with corresponding percentiles for the t-distribution. There-
fore, if we would like to have a test with level of significance α, we should reject H0
if |t| ≥ tα/2,N −2 .
If the noise is not Gaussian, these statements become asymptotic: Z is, ap-
proximately, standard normal and t is, approximately, t-distributed. Hence, the
stationarity test is no longer precise but an approximate one.
Prediction. The model is typically used for a short term prediction. So, an
m-step ahead prediction constructed at time N , is given by the formula
(1.13) X̂N +m (N ) = â0 + b̂(N + m − t̄)
Hence, the prediction error is equal to
(1.14) XN +m − X̂N +m (N ) = a0 − â0 + (b − b̂)(N + m − t̄) + εN +m
Taking into account (1.9), we see that XN +m −X̂N +m (N ) is (approximately) normal
with zero expectation and the variance
!
2 1 (N + m − t̄)2
(1.15) σε 1 + + PN
N t=1 (t − t̄)
2
(see Problem 2). Replacing σε2 by its estimate from (1.10), we can construct an
(approximate) (1 − α)-confidence interval for XN +m (see details below):
s
0 1 (N + m − t̄)2
(1.16) â + b̂(N + m − t̄) ± tα/2,(N −2) σ̂ε 1 + + P
N (t − t̄)2
where tα,k is a percentile for the t-distribution with k degrees of freedom. In
particular, if m is comparable to N , then the width of the interval grows indicating
that the forecast is less reliable.
Example. Let the data be given by the following table:
t 1 2 3 4 5 6 7 8
Xt 1.87 11.74 3.78 0.15 3.72 −9.79 12.13 −0.81
t 9 10 11 12 13 14 15 16
Xt 4.72 −19.48 −1.06 −16.96 15.9 1.28 4.85 −6.74
t 17 18 19 20 21 22 23 24
Xt −16.41 3.36 18.38 14.5 5.41 21.2 10.77 11.43
t 25 26 27 28 29 30
Xt −5.56 4.15 25.33 25.16 10.03 11.59
We have N = 30 and t̄ = (N + 1)/2 = 15.5. From (1.4) and (1.6), â0 = 4.75467 and
b̂ = 0.509597, so the estimated trend is given by the formula
4.75467 + 0.509597(t − 15.5) = −3.14409 + 0.509597t
Next, we compute Q(â0 , b̂) = 3, 234.5 which gives σ̂ε2 = 115.518 and σ̂ε = 10.7479.
In order to test the hypothesis H0 : b = 0, we use the formula (1.12). We get
t = 2.24 and the corresponding quantile for the t statistics is 2.048, so we reject
the null hypothesis, the trend is statistically significant. In order to construct a
prediction, we just use the formula (1.13). If, for instance, we want a prediction for
five steps ahead, we get
X̂31 (30) = 12.6534, X̂32 (30) = 13.163, X̂33 (30) = 13.6726,
X̂34 (30) = 14.1822, X̂35 (30) = 14.6918
21
40
30
20
10
5 10 15 20 25 30 35
-10
-20
Figure 1. Data with fitted trend, −3.14409 + 0.509597t.
Confidence bounds for prediction could be obtained from (1.16). Namely, the upper
bounds are equal to
36.1626, 36.8185, 37.4826 38.1547 38.8346
and the lower bounds are
−10.8557, −10.4925, −10.1374 − 9.79025 − 9.45098.
The data and fitted trend with prediction and confidence bounds are shown on
Figure 1.
Least Squares and Maximum Likelihood. In order to be able to speak
about likelihood, we need to make assumptions about the distribution of the noise
εt . Let us assume once again that εt are independent normal random variables
with zero expectation and variance σε2 . Then Xt are also independent and normal
with expectation a0 + b(t − t̄) and variance σε2 . Therefore their joint density could
be found from the formula
N
1 1 X
fX1 ...XN (x1 , . . . , xN ) = exp{− (xt − a0 − b(t − t̄))2 }
(2π)N/2 σεN 2σε2 t=1
Substituting the data X1 , . . . , XN instead of x1 , . . . , xN , we get the likelihood func-
tion
1 Q(a0 , b)
L(a0 , b, σε2 ) = exp{− }.
(2π) N/2 N
σε 2σε2
Its logarithm therefore equals
N N Q(a0 , b)
log L(a0 , b, σε2 ) = − log(2π) − log(σε2 ) − .
2 2 2σε2
So, in order to maximize the log likelihood function, we have to minimize Q(a0 , b),
that is, find the least squares estimates for a0 and b. After that, we have to choose
σε2 that maximizes
N Q(â0 , b̂)
− log(σε2 ) − .
2 2σε2
Setting the derivative to zero, we find the estimate
Q(â0 , b̂)
σ̂ε2 =
N
22
100
80
60
40
20
5 10 15 20 25 30 35
-20
Figure 2. Same data set with an error (X29 = 100). Estimated

trend is now −8.52161 + 1.05002t.
So, the maximum likelihood estimates for the coefficients a0 and b coincide with
least squares estimates. The estimate for σε2 differs from the one that we have
discussed earlier (we divide by N instead of N − 2), it is definitely biased. In a
sense, this is typical (maximum likelihood estimates for second moments are usually
biased).
Other parametric families. If the data set is long enough, we may use other
parametric curves, like polynomials (still linear in parameters, hence still a linear
regression). In some applications, we have to use models with asymptotic values.
In particular, it could be a logistic curve:
a
Xt = + εt
1 + be−ct
or so called Gompertz curve:
log Xt = a + brt + εt
where 0 < r < 1. For them, the rate of convergence to the limiting level is exponen-
tial. All models of those types are non-linear in parameters, so the estimation is no
longer an easy computational problem. However, if εt are normal, then least squares
estimates for all the parameters except the variance of the noise, still coincide with
maximum likelihood estimates.
Least squares and robustness. How sensitive is the method of least squares
to outliers or other errors? In order to illustrate that, consider the data set from
the example shown on Figure 1. Next two graphs (see Figures 2 and 3) show what
happens if we replace the second to last value by 100, and by 500 (this may look
harsh, but we are talking here about registration errors, for example, a tenfold error
(misplaced decimal point), or a zero instead of the value). For the original data, the
estimated trend is equal to −3.14409+0.509597t, for the first modification it is equal
to −8.52161 + 1.05002t, and for the second modification it is −32.4297 + 3.45269t.
On the Figure 4, you can compare the original data and the three trend lines.
One of the alternatives is to use a method of least absolute errors (LA), that
is to minimize the sum of absolute values of errors (instead of squares):
N
X
|Xt − a − bt| → min
t=1
The optimal values of the parameters could be found by the methods of linear
programming. Such estimates are not the best in the Gaussian case, but they
23
150
100
50
5 10 15 20 25 30 35
Figure 3. Same data set with an error (X29 = 500). Estimated

trend is now −32.4297 + 3.45269t. Confidence intervals for the
forecast no longer fit the screen.
60
40
20
0
5 10 15 20 25 30
-20
Figure 4. The Data and three trend lines from the above graphs.
30
20
10
0
5 10 15 20 25 30
-10
-20
Figure 5. Least squares trend versus least absolute errors trend.

The least squares trend is a bit more slanted out of the two.
are not sensitive to data errors. For instance, for the above data and for both
modifications, the estimated trend is 1.60379 + 0.266208t (you can compare it with
the least squares trend on Figure 5). If we make X29 negative, say, −100 or −500,
then it becomes 0.110027 + 0.315999t. In fact, there are just two trend lines that
may appear if we change X29 . If X29 becomes bigger than some critical value
(whatever it is), then it is less slanted 1.60379 + 0.266208t, otherwise it is equal to
0.110027 + 0.315999t. You can see those trend lines on Figure 6.
24
40
30
20
10
0
5 10 15 20 25 30
-10
-20
Figure 6. Least Absolute Errors trend and data errors. The

graph actually contains 5 trend lines (for the original data, for
X29 = 100, for X29 = 500, for X29 = −100 and for X29 = −500.
Another option is to use some penalty function ϕ:

X
ϕ(Xt − a − bt) → min
However, how to choose the penalty function (there exist some suggestions though)?
Also, minimization of this expression could be a challenging computational problem.
Details. 1. Verification of (1.10). We begin with the identity
(1.17) εt = Xt − a0 − b(t − t̄) = (Xt − â0 − b̂(t − t̄)) + (â0 − a0 ) + (b̂ − b)(t − t̄).
Therefore
XN N
X N
X
ε2t = (Xt − â0 − b̂(t − t̄))2 + N (â0 − a0 )2 + (b̂ − b)2 (t − t̄)2
t=1 t=1 t=1
N
X N
X
+ 2(â0 − a0 ) (Xt − â0 − b̂(t − t̄)) + 2(â0 − a0 )(b̂ − b) (t − t̄)
t=1 t=1
N
X
+ 2(b̂ − b) (Xt − â0 − b̂(t − t̄))(t − t̄)
t=1
PN
The last term is zero by (1.5). Also, t=1 (t − t̄) = 0 and therefore the next to
PN
last term is also zero. Finally, t=1 (Xt − â0 ) = 0 by (1.4) and therefore the fourth
term is also zero. Hence
XN N
X
(1.18) ε2t = Q(â0 , b̂) + N (â0 − a0 )2 + (b̂ − b)2 (t − t̄)2
t=1 t=1
Taking the expectation of both sides in (1.18) and using (1.9), we get

E Q(â0 , b̂) = (N − 2)σε2
and therefore the estimate (1.10) is unbiased.

2. Distribution of σ12 Q(â0 , b̂) (sketch). If the noise is Gaussian, then â0 and
ε
b̂ have a joint Gaussian distribution as linear combinations of the noise ε1 , . . . , εN .

However, they are uncorrelated by (1.9), and therefore independent. Moreover, let
et = Xt − â0 − b̂(t − t̄) = εt + (a0 − â0 ) + (b − b̂)(t − t̄)
25
be the residuals in the model. The residuals are still linear combinations of the
values of the noise, and therefore all et also have a joint Gaussian distribution
together with â0 and b̂. However,
1 2 1
Cov(et , â0 ) = Cov(εt , â0 ) − Cov(â0 , â0 ) − (t − t̄) Cov(b̂, â0 )) = σ − σ2 = 0
N ε N ε
(we used formula (1.4) to compute Cov(εt , â0 ) and (1.9) for the variance of â0 and
covariance of â0 and b̂). For similar reasons (this time, use (1.9) and (1.6)),
Cov(et , b̂) = Cov(εt , b̂) − Cov(â0 , b̂) − (t − t̄) Cov(b̂, b̂))

(t − t̄) (t − t̄)
= PN σε2 − PN σ2 = 0
(t − t̄)2 (t − t̄)2 ε
t=1 t=1
Therefore each et is independent from â0 and b̂, and so is Q(â0 , b̂) = e2t . Next,
P
let’s divide each side of (1.18) by σε2 . We have
N PN
X ε2t Q(â0 , b̂) N (â0 − a0 )2 (b̂ − b)2 t=1 (t − t̄)2
(1.19) = + +
σ2
t=1 ε
σε2 σε2 σε2
The expression on the left is a sum of N squares of independent standard normal

random variables, so it has a χ2 distribution with N degrees of freedom. Next,
the second and third terms on the right also represent squares of independent
standard normals (to see that, we have to recall the variances of â0 and b̂ from
(1.9)). So, each of them therefore has a χ2 distribution with one degree of freedom.
Finally, since Q(â0 , b̂) is independent from â0 and b̂, the moment generating function
(or characteristic function, if we prefer) of the left side of (1.19) is equal to the
product of the moment generating functions of each of the three terms on the right,
and therefore the moment generating function of Q(â0 , b̂)/σε2 could be found as
a quotient. And, it turns out to be the moment generating function of the χ2
distribution with N − 2 degrees of freedom.
3. Verification of (1.16) (sketch). Once again, assume that the noise
is Gaussian. According to the previous, U = σ12 Q(â0 , b̂) has χ2 distribution with
ε
N −2 degrees of freedom, and it is independent from â0 and b̂. It is also independent
from εN +m whenever m > 0. Therefore U is independent from XN +m − X̂N +m (N )
because of (1.14). Next, random variable XN +m − X̂N +m (N ) is normal with ex-
pectation zero and variance given by the formula (1.15). Therefore
XN +m −X̂N +m (N )
r
1 (N +m−t̄)2
σε 1+ N + P(t−t̄)2
(1.20) t= q
Q(â0 ,b̂)
σε2 (N −2)
has t distribution with N − 2 degrees of freedom. However, σε2 cancels out, and
Q(â0 ,b̂)
(N −2) = σ̂ε2 . For those reasons, the expression in (1.20) actually boils down to
XN +m − X̂N +m (N )
(1.21) t= q 2
σ̂ε 1 + N1 + (N
P+m−t̄)
(t−t̄)2
26
Now, |t| does not exceed tα/2,(N −2) with probability 1 − α. Solving for XN +m and
recalling that X̂N +m (N ) = â0 + b̂(N + m − t̄), we get (1.16). If the noise is not
Gaussian, the statement becomes an asymptotical one.
Exercises
1. Verify (1.9).
2. Using (1.9), show that the variance of the prediction error (1.14) is indeed
given by (1.15).
3. For a data set (will be posted on the web), fit a linear trend, test the
hypothesis H0 : b = 0 at the 5% level of significance and construct a forecast with
95% confidence intervals for the next five values of the series.
2. Moving Averages
Suppose now that the series is long (and non-stationary). As a rule, there is no
parametric curve that approximates the whole data set. There exist several meth-
ods of forecasting of non-stationary series which will be discussed later. However,
sometimes we need just to decompose the series into a sum of a trend and a random
component. One of the ways to proceed is called the method of Moving Averages.
It is based on a local approximation by a polynomial trend. In order to estimate
the trend f (t) at time t, we fit a polynomial trend to the data on the time interval
[t − l, t + l]. The value of the fitted trend at time t is the estimate for f (t). The
degree of the polynomial and the window width l are the parameters of the method.
It looks like we have to re-do the computations completely as we move from t
to t + 1 but the reality is not so bad. To figure out what is going on, as well as why
it is a reasonable thing to do, let us suppose that the data can be represented as
Xt = f (t) + εt
where εt (noise) is i.i.d. and f (t) (trend) is “smooth”, that is, it can be locally
approximated by a polynomial.
To begin with, suppose that f (t) is locally linear—any portion of the data of
given length 2l + 1 can be treated as a series with linear trend at + b, however, the
coefficients a and b slowly change in time.
In order to compute the coefficients of the trend on the interval [t − l, t + l], we
have to minimize the sum
t+l
X
(Xs − a − bs)2
s=t−l
or, which is the same,

l
X l
X l
X
(Xt+i − a − b(t + i))2 = (Xt+i − (a + bt) − bi)2 = (Xt+i − ã − bi)2
i=−l i=−l i=−l
where ã = a + bt. Note that ã is exactly the value of the trend at t. Differentiating
with respect to ã, we get the equation
l
X
(Xt+i − ã − bi) = 0
i=−l
27
which implies
l
X l
X
Xt+i = (2l + 1)ã + b i = (2l + 1)ã
i=−l i=−l
Therefore the estimate for ã, and therefore the estimate fˆ(t) for the value of the
trend at time t, could be found from the formula
l
1 X
(2.1) ã = fˆ(t) = Xt+i
2l + 1
i=−l
So, fˆ(t) is actually just the average of the values of the series over the time interval
[t − l, t + l] (the moving average, or the moving average of the 1st order, since we
are using an approximation by a linear function). Note that we don’t actually need
the value of the coefficient b.
If we increase the degree of the polynomial, we arrive at weighted moving aver-
ages. For instance, let us assume that the function f (t) can be locally approximated
by a parabola. As above, we have to minimize the sum
t+l
X
(Xs − a − bs − cs2 )2
s=t−l
or, which is the same, the sum

l
X
(Xt+i − a − b(t + i) − c(t + i)2 )2
i=−l
l
X l
X
= (Xt+i − (a + bt + ct2 ) − (b + 2ct)i − ci2 )2 = (Xt+i − ã − b̃i − ci2 )2
i=−l i=−l
2
where ã = a + bt + ct and b̃ = b + 2ct. Once again, ã is the value of the trend at
time t, and therefore an estimate for ã is an estimate for f (t).
Differentiating with respect to ã and c, we get two equations:
l
X l
X l
X l
X
Xt+i = ã(2l + 1) + b̃ i+c i2 = ã(2l + 1) + c i2 ,
i=−l i=−l i=−l i=−l
l
X l
X l
X l
X l
X l
X
i2 Xt+i = ã i2 + b̃ i3 + c i4 = ã i2 + c i4 ,
i=−l i=−l i=−l i=−l i=−l i=−l
Pl Pl 3
since i=−l i= i=−l i = 0. Let us denote
l
X l(l + 1)(2l + 1)
I2 = i2 =
3
i=−l
l
l(l + 1)(2l + 1)(3l2 + 3l − 1)
X
I4 = i4 =
15
i=−l
l
X
St = Xt+i
i=−l
28
l
X
Zt = i2 Xt+i
i=−l
Solving for c, we get
Zt I2
c= − ã
I4 I4
which implies
l
St − I2 Zt /I4 1 X I2
ã = fˆ(t) = = (1 − i2 )Xt+i .
2l + 1 − (I2 )2 /I4 2l + 1 − (I2 )2 /I4 I4
i=−l
Therefore the estimate fˆ(t) is equal to the weighted moving average, or moving
average of the 2nd order,
i=l
X
(2.2) fˆ(t) = ai Xt+i
i=−l
with weights
(1 − i2 I2 /I4 )
(2.3) ai = , i = −l, . . . , l.
2l + 1 − (I2 )2 /I4
Yet another interpretation. Let again Xt = f (t) + εt where εt is i.i.d. and
f (t) is locally linear, and let
k=l
1 X
Yt = Xt+k
2l + 1
k=−l
be the moving average. Then

k=l k=l
1 X 1 X
Yt = f (t + k) + εt+k = f (t) + εt
2l + 1 2l + 1
k=−l k=−l
where f (t) is the average of f (t) over the interval [t − l, t + l] and εt is a similar
average of the noise. If the function f is (nearly) linear, then f (t) ≈ f (t) and
1
Var(εt ) = Var(εt )
2l + 1
has been significantly reduced.
Suppose now that f (t) is locally quadratic and we are using the corresponding
weighted moving averages. Once again,
k=l
X k=l
X
Yt = ak f (t + k) + ak εt+k = f (t) + εt
k=−l k=−l
where f (t) and εt are the weighted moving averages. However, the weights are
designed in such a way that f (t) ≈ f (t) (if f is precisely quadratic, then f = f , see
Problem 4), and the variance of the new noise is equal to
X
Var(εt ) = a2i Var(εt )
which is also substantially smaller than Var(εt ). However, though the original noise
was i.i.d., the new noise is still identically distributed but no longer independent.
29
Log Crops Data

6
5
4
3
2
1
Years
1550 1600 1650 1700 1750 1800 1850
Figure 7. Logarithm of the crops price index data.
Log Crops Data
5.4
5.2
5.0
4.8
4.6
4.4
4.2
Years
1600 1650 1700 1750 1800
Figure 8. Moving Average of the first order, l = 10.
Comments. 1. Role of the Window Width. How does the result depend
on the window width? The bigger is l, the less weight is given to any specific
observation, and the estimated trend becomes more smooth. On the other hand, if
we increase the degree of the polynomial, the estimated trend becomes less smooth,
more sensitive to small scale fluctuations. For the following data set (logarithm
of the crops price data, about 350 points, see Figure 7), compare the first order
moving average with l = 10 (Figure 8), first order moving average with l = 25
(Figure 9) and second order moving average with l = 25 (Figure 10). Note how
relatively big values around the year 1630 are noticed by the first and the third
graphs, and practically ignored by the second one.
2. Rules of a thumb. First of all, polynomials of degrees higher than three
do not pay off. Next, the width of the window (2l + 1) should be at least 10 times
bigger than the number of parameters used (say, l should be at least 10 for a linear
trend, at least 20 for quadratic and cubic). Also, the polynomial approximation
should look plausible within the window of that size (we have to examine the graph
for that). For instance, for the data shown on Figure 11, any approximation around
the point t = 29 does not look possible. Other than that, linear approximation is
not likely and parabolic or cubic is probably okay for l = 20.
3. Drawbacks. the method requires l points before and after the point t.
Hence, for a data set X1 , . . . , XN , the estimate fˆ(t) can only be computed for time
instances between l + 1 and N − l (so called end effects). In order to handle time
instances that are close to the beginning and the end of the data set, one could fit
30
Log Crops Data
5.4
5.2
5.0
4.8
4.6
4.4
4.2
Years
1600 1650 1700 1750 1800
Figure 9. Moving Average of the first order, l = 25.
Log Crops Data
5.4
5.2
5.0
4.8
4.6
4.4
4.2
Years
1600 1650 1700 1750 1800
Figure 10. Moving Average of the 2nd order, l = 25.
Data
4.5
4.0
3.5
3.0
Time
0 20 40 60 80
Figure 11. Coal production data. The series changes its behavior
around point t = 29, local approximation by a polynomial is not
likely around that point.
a trend line to the available section of the data (say, in order to estimate f (2), we
could fit a trend to the points X1 , . . . , X2+l ; however, this is no longer the weighted
average with the same weight coefficients). Also, the method hardly could be used
for the prediction (even if we fit a trend to the last portion of the data, we are using
only the last l + 1 data points and we are ignoring the rest of the data).
Robustness of the method and 53X smoothing. As everything based on
the least squares, the method is extremely sensitive to data errors (outliers). To
illustrate that, we have replaced one of the values in the above example (Figure 7)
31
Log Crops Data
5.4
5.2
5.0
4.8
4.6
4.4
4.2
Years
1690 1700 1710 1720 1730 1740 1750 1760
Figure 12. Data with outlier: Moving Average of the first order,
l = 10.
Log Crops Data
5.4
5.2
5.0
4.8
4.6
4.4
4.2
Years
1690 1700 1710 1720 1730 1740 1750 1760
Figure 13. Data with outlier: Moving Average of the 3rd order,
l = 25.
by 15. Results are shown on Figures 12 and 13 (compare them with Figures 8 and
10, respectively).
An alternative is to combine the method with other procedures that eliminate or
reduce outliers. One of the possibilities is the so called 53X smoothing (suggested
by Tukey). It is an heuristic procedure with no probabilistic interpretation or
motivation. However, it practically eliminates irregularities, such as outliers. It is
a three stage procedure.
First step: We define Yt as a median of 5 values Xt−2 , Xt−1 , Xt , Xt+1 , Xt+2
(that is, we re-arrange them in an increasing order, and take the middle point).
Second step: Zt is a median of Yt−1 , Yt , Yt+1 .
Third step: Finally, Wt = .25Zt−1 + .5Zt + .25Zt+1 .
The following example shows how it works:
Xt 55 54 56 52 76 113 68 59 74 64
Yt ... ... 55 56 68 68 74 68 ... ...
Zt ... ... ... 56 68 68 68 ... ... ...
Wt ... ... ... ... 65 68 . . . ... ... ...
End effects are not significant here (we are losing only four points at each end;
there exists a modification of the algorithm that allows us to handle those as well).
The 53X procedure does not significantly change the data just by itself (see
Figure 14). However, if we apply 53X smoothing before the (weighted) moving
32
Log Crops Data
5.4
5.2
5.0
4.8
4.6
4.4
4.2
Years
1690 1700 1710 1720 1730 1740 1750 1760
Figure 14. Original series (log of the price index) and the 53X smoothing.
Log Crops Data
5.4
5.2
5.0
4.8
4.6
4.4
4.2
Years
1690 1700 1710 1720 1730 1740 1750 1760
Figure 15. Data with outlier: a combination of the 53X smooth-

ing and 1st order moving averages with l = 10. Compare with
Figure 12
averages, it practically eliminates the effect of outliers (see Figure 15; compare it
with Figure 12).
Exercises
1. Find the formula for the weights for a cubic polynomial and compare them
with the weights given by the formula (2.3).
2. Find the formula for the weights for a 4th degree polynomial. Give the
answer in terms of I2 , I4 ,
l
l(l + 1)(2l + 1)(3l4 + 6l3 − 3l + 1)
X
6
I6 = i =
21
i=−l
and
l
l(l + 1)(2l + 1)(5l6 + 15l5 + 5l4 − 15l3 − l2 + 9l − 3)
X
I8 = i8 = .
45
i=−l
3. Show that, if Xt = a0 + a1 t is a linear function, then the moving average

(2.1) does not change Xt .
4. Show that, if Xt = a0 + a1 t + a2 t2 + a3 t3 is a cubic polynomial, then second
order weighted moving average (described by the formula (2.3)) does not change
Xt .
33
5.(a) Apply moving average (2.1) with l = 15 to a data set (will be posted
on the web) and graph the results. (b) For the same data, first, apply the 53X
procedure, and then apply the moving average to the result. Graph the results and
compare them with the previous ones.
6.(a) Apply the weighted moving average (2.2), (2.3) with l = 25 to a data
set (will be posted on the web) and graph the results. (b) For the same data,
first, apply the 53X procedure, and then apply the weighted moving average to the
result. Graph the results and compare them with the previous ones.
3. Exponential Smoothing
Exponential smoothing is a method of prediction based on a local approxi-
mation of a series by a polynomial (or, sometimes, by another parametric curve).
Though it is called “smoothing”, it has nothing to do with smoothing at all. In the
simplest form, it could be described as follows.
Suppose we have a long series and the observations have a structure Xt =
a(t) + εt where εt is the white noise, that is i.i.d. random variables with zero
expectation, and a(t) very slowly changes in time. So, the most reasonable short
term prediction at time n should be a constant, namely the last value of the trend
a(n). In order to estimate a(n), we could take an average of the recent observations,
but then we have to decide how many of them to include. Also, when another
observation becomes available, we have to do all the computations again.
Instead of that, we fit a model Xt = a + εt to the whole series, giving more
weight to recent observations. Namely, we minimize the sum
∞
X
(3.1) β k (Xn−k − a)2
k=0
where 0 < β < 1 is a discount coefficient (to outline the idea and to simplify the
formulae, we assume that the observations begin at −∞). Differentiating with
respect to a, we get
∞
X
β k (Xn−k − a) = 0
k=0
and therefore
P∞ k ∞
β X
P∞ kn−k =
X
(3.2) â(n) = k=0 (1 − β)β k Xn−k
k=0 β k=0
so it is a weighted average of the past values, with exponentially decaying weights.

Note that the sum of the weights,
∞
X
(1 − β)β k = 1
k=0
Hence, Xn which is the last available observation, receives a relative weight (1 − β).
34
Denote by X̂n+1 (n) a one step ahead forecast made at time n. In our case, it
is equal to â(n) and therefore it is given by the formula (3.2). We therefore have
∞
X
X̂n+1 (n) = (1 − β)β k Xn−k
k=0
∞
X
(3.3) = (1 − β)Xn + β (1 − β)β k Xn−1−k
k=0
= (1 − β)Xn + β X̂n (n − 1)
= X̂n (n − 1) + (1 − β)(Xn − X̂n (n − 1)).
Now, we denote α = 1 − β and en = Xn − X̂n (n − 1) (so en is the one step ahead

prediction error, the difference between the actual value Xn and the prediction
X̂n (n − 1) made at time n − 1). Also, we replace X̂n+1 (n) by â(n) and X̂n (n − 1)
by â(n − 1). Then (3.3) becomes
â(n) = â(n − 1) + αen .
Changing from n to n + 1, we get an equation
(3.4) â(n + 1) = â(n) + αen+1
The equation (3.4) describes the so called zero order exponential smoothing. Here
α = 1 − β is the parameter of the algorithm. It is called the smoothing constant.
First and Second order Exponential Smoothing. Instead of approximat-
ing a series by a constant, we could approximate it by a polynomial. This way,
we get exponential smoothing of the orders one and two (on practice, they never
use anything beyond that). Let us discuss the first order algorithm. In order to
construct a prediction at time n, we (locally) approximate the data by a linear
function
Xn+k ≈ a(n) + b(n)k + εn+k .
In order to estimate a and b, we minimize
∞
X
Q(a, b) = β k (Xn−k − a + bk)2
k=0
Differentiating with respect to a and b, we get the equations

∞
X
β k (Xn−k − a + bk) = 0
k=0
(3.5) ∞
X
kβ k (Xn−k − a + bk) = 0
k=0
Denote by â(n) and b̂(n) the solution to (3.5). Let X̂n+1 (n) = â(n) + b̂(n) be the
one step ahead prediction and let en+1 be the one step ahead prediction error
en+1 = Xn+1 − â(n) − b̂(n).
It could be shown (see the sketch below) that the estimates â(n + 1) and b̂(n + 1)
could be obtained from the previous estimates â(n) and b̂(n) and from the prediction
35
error en+1 via the equations
â(n + 1) = â(n) + b̂(n) + (2α − α2 )en+1

(3.6)
b̂(n + 1) = b̂(n) + α2 en+1
where α = 1 − β is the smoothing parameter. The equations (3.6) describe the

first order exponential smoothing. Note that if en+1 is zero, then the coefficient
b(n) does not change, and the adjustment of a(n) is due to the change of the origin
(when we move from n to n + 1, Xn+k becomes Xn+1+(k−1) ).
In a similar way, we could handle the second degree polynomials. In that case,
we approximate Xn+k by a quadratic function
Xn+k ≈ a(n) + b(n)k + c(n)k 2 + εn+k .
We estimate a(n), b(n) and c(n) by minimizing
∞
X
(3.7) Q(a, b, c) = β k (Xn−k − a + bk − ck 2 )2 .
k=0
Once again, let X̂n+1 (n) = â(n) + b̂(n) + ĉ(n) be the one step ahead prediction and
let en+1 be the one step ahead prediction error
en+1 = Xn+1 − â(n) − b̂(n) − ĉ(n).
Then the estimates â(n + 1), b̂(n + 1) and ĉ(n + 1) are related to the previous
estimates â(n), b̂(n) and ĉ(n) and the prediction error en+1 by the equations
â(n + 1) = â(n) + b̂(n) + ĉ(n) + α(3 − 3α + α2 )en+1

α
(3.8) b̂(n + 1) = b̂(n) + 2ĉ(n) + 3α2 (1 − )en+1
2
α3
ĉ(n + 1) = ĉ(n) + en+1
2
The equations (3.8) describe the second order exponential smoothing. Once again,
if the prediction error en+1 is zero, then the coefficient c(n) does not change, and
the adjustment of a(n) and b(n) is due to the change of the origin.
Confidence bounds? No. Can we come up with some confidence bounds
for the prediction, as we did when we discussed the linear trend model? The
answer is ‘No’. Confidence bounds constructed in Section 1.1, were based on several
assumptions, among them the independency of the noise, the model itself that is
the equation (1.1) and more. In particular, we could claim that the prediction is
unbiased etc. None of those is applicable here because we don’t assume anything
about the structure of the data.
However, we still can suggest something. What we can do is to check how good
was the prediction in the past, and estimate the variance of the one step ahead
prediction (and, more generally, variance of k steps ahead prediction). Namely,
for one step ahead prediction, we can compute the average of the squares of the
prediction errors,
n
1 X
σ̂12 = (Xk − X̂k (k − 1))2
n−1
k=2
36
Then, we can do the same for two steps ahead prediction,

n
1 X
σ̂22 = (Xk − X̂k (k − 2))2
n−2
k=3
and so on. After that, we can compute the standard deviations and get the bounds
X̂n+k (n) ± 2σ̂k . The bounds are purely empirical, there is no probability associated
to them, they only show how good was the prediction in the past. Nonetheless, they
might give us an idea about how the method is working for the given data. In par-
ticular, if the prediction deviates from the actual data by more than two standard
deviation several times in a row, it might mean that the generating mechanism has
been changed and we should forget about the old data and start anew.
Practical implementation. A time series never starts at −∞. So, we use
the first observation to set up initial value(s) for the parameter(s). For instance,
in case of zero order exponential smoothing, we set â(1) = X1 . If we use the first
order exponential smoothing, then we set â(1) = X1 and b̂(1) = 0. In case of second
order exponential smoothing, we set â(1) = X1 and b̂(1) = ĉ(1) = 0. After that,
we use the corresponding renewal equations (3.4) or (3.6) or (3.8). The following
numerical example shows how it works.
Let the first four values of series be X1 = 3, X2 = 7, X3 = 6, X4 = 5, X5 = 4,
and let α = 0.1. For the zero order exponential smoothing, we set â(1) = X1 = 3.
Then our prediction for X2 equals 3, and prediction error equals e2 = X2 −â(1) = 4.
Therefore â(2) = â(1) + αe2 = 3.4, and that is our prediction for X3 . Hence, the
prediction error e3 = X3 − â(2) = 2.6 and â(3) = â(2) + αe3 = 3.66. So, our
prediction for X4 is 3.66, and the prediction error e4 = X4 − â(3) = 1.34. Next,
â(4) = â(3) + αe4 = 3.794, and that is our prediction for X5 . So, the prediction
error e5 = X5 − â(4) = 0.206. Finally, â(5) = â(4) + αe5 = 3.8146, and that is our
prediction for X6 (and for all subsequent time instances as well).
Time Xi Prediction Prediction error â(i)
1 3 ... ... 3
2 7 3 4 3.4
3 6 3.4 2.6 3.66
4 5 3.66 1.34 3.794
5 4 3.794 0.206 3.8146
6 ... 3.8146 ... ...
Zero Order Exponential Smoothing
For first order exponential smoothing, we set â(1) = X1 = 3 and b̂(1) = 0. So,
our prediction for X2 still equals 3 and the prediction error e2 still equals 4. For
that reason, â(2) = â(1) + b̂(1) + (2α − α2 )e2 = 3.76 and b̂(2) = b̂(1) + α2 e2 = 0.04.
Therefore our prediction for X3 equals â(2) + b̂(2) = 3.8 and the prediction error e3
turns out to be equal to 2.2. Once again, â(3) = â(2) + b̂(2) + (2α − α2 )e3 = 4.218
and b̂(3) = b̂(2) + α2 e3 = 0.062. So, our prediction for X4 equals â(3) + b̂(3) = 4.28
and prediction error e4 turns out to be 0.72. Hence â(4) = â(3) + b̂(3) + (2α −
α2 )e4 = 4.4168 and b̂(4) = b̂(3) + α2 e4 = 0.0692, our prediction for X5 equals
â(4) + b̂(4) = 4.486 and the prediction error e5 = −0.486. Using the iteration
formulas (3.6) for the last time, we find â(5) = â(4) + b̂(4) + (2α − α2 )e5 = 4.39366
and b̂(4) = b̂(3) + α2 e4 = 0.06434. Our prediction for X6 equals â(5) + b̂(5) =
37
4.458.If, at this time, we need a i steps ahead prediction, that is a prediction for
X5+i , then it equals â(5) + b̂(5)i. So, prediction for X7 equals 4.52234, prediction
for X8 equals 4.58668 and so on.
Time Xi Prediction Prediction error â(i) b̂(i)

1 3 ... ... 3 0
2 7 3 4 3.76 0.04
3 6 3.8 2.2 4.218 0.062
4 5 4.28 0.72 4.4168 0.0692
5 4 4.486 −0.486 4.39366 0.06434
6 ... 4.458 ... ... ...
First Order Exponential Smoothing
Finally, for second order exponential smoothing, we set â(1) = X1 = 3 and

b̂(1) = ĉ(1) = 0. Once again, our prediction for X2 equals 3 and the prediction
error e2 equals 4. Now we have to use the iteration formulas (3.8). According to
them, â(2) = â(1) + b̂(1) + ĉ(1) + α(3 − 3α + α2 )e2 = 4.084, b̂(2) = b̂(1) + 2ĉ(1) +
3
3α2 (1 − α2 )e2 = 0.114 and ĉ(2) = ĉ(1) + α2 e2 = 0.002. Therefore the prediction for
X3 equals â(2) + b̂(2) + ĉ(2) = 4.2 and the prediction error e3 equals 1.8. Again,
we use (3.8) and get â(3) = 4.6878, b̂(3) = 0.1693, ĉ(3) = 0.0029. Therefore our
prediction for X4 equals â(3)+ b̂(3)+ĉ(3) = 4.86 and next prediction error e4 equals
0.14. Using (3.8) one more time, we get â(4) ≈ 4.898, b̂(4) ≈ 0.179, ĉ(4) = 0.00297.
Hence, the prediction for X5 equals â(4) + b̂(4) + ĉ(4) = 5.08 and the prediction
error e5 = −1.08. Finally, we get hata(5) ≈ 4.787, b̂(4) ≈ 0.154, ĉ(4) = 0.00243 and
the prediction for X6 equals â(5) + b̂(5) + ĉ(5) = 4.944. Prediction for X5+i , that is,
i steps ahead prediction, is given by the formula â(5) + b̂(5)i + ĉ(5)i2 . In particular,
prediction for X7 equals 5.10554, prediction for X8 equals 5.61932 and so on.
Time Xi Prediction Prediction error â(i) b̂(i) ĉ(i)

1 3 ... ... 3 0 0
2 7 3 4 4.084 0.114 0.002
3 6 4.2 1.8 4.6878 0.1693 0.0029
4 5 4.86 0.18 4.898 0.179 0.00297
5 4 5.08 −1.08 4.787 0.154 0.00243
6 ... 4.944 ... ... ... ...
Second Order Exponential Smoothing
Remark. Note that the adjustment of coefficients b and c goes at a much

slower rate than that of the coefficient a. For that reason, wrong initial values for
the coefficients b and c can spoil the whole thing (more than the wrong initial value
for a). If the series is long enough, it might be a good idea to use the first few tens
of points in order to estimate the coefficients of the trend, and then use them as
‘seed values’ in the iterations for the rest of the series.
Example. To illustrate the method, we have applied it to the following data,
shown on Figure 16, representing a concentration of some substance during certain
(long) chemical reaction.
We apply zero order, first order and second order smoothing to it, with the
same α = 0.1. On each of the corresponding graphs (see Figures 17, 18 and 19), we
38
Concentration Data
18.0
17.5
17.0
16.5
Time
50 100 150 200
Figure 16. Concentration data
Concentration Data
18.0
17.5
17.0
16.5
16.0 Time
50 60 70 80 90 100
Figure 17. The data, one step ahead predictions and five steps
ahead prediction made at t = 96, according to the zero order ex-
ponential smoothing for the concentration data, α = 0.1.
Concentration Data
18.0
17.5
17.0
16.5
16.0 Time
50 60 70 80 90 100
ahead prediction made at t = 96, according to the first order ex-
ponential smoothing for the concentration data, α = 0.1.
can see the original data, the series made out of one step ahead predictions, and a
five steps ahead prediction made at time instant t = 96.
Role of the smoothing parameter α. The bigger is α, the smaller is β,
the more weight is given to recent observations, and the forecast is more sensitive
to the noise. In fact, α = (1 − β) is precisely the relative weight given to the very
last observation. The one before gets the weight αβ and so on. So if α is close
39
Concentration Data
18.0
17.5
17.0
16.5
16.0 Time
50 60 70 80 90 100
ahead prediction made at t = 96, according to the second order
exponential smoothing for the concentration data, α = 0.1.
Concentration Data
18.0
17.5
17.0
16.5
16.0 Time
50 60 70 80 90 100
to one, we just ignore everything but the last observation, and our one step ahead
prediction is just equal to the previous value (not a good idea at all). On practice,
you should never use values of α that are bigger than 0.2 (otherwise we are giving
the last observation too much weight).
On the other hand, if α is very small, the coefficients adjust way too slow. If,
in addition, the degree of the polynomial is not adequate (for instance, the data
contains trend but we are using the zero order exponential smoothing) or if the
initial values for the coefficients are chosen in a wrong way, the model leads to large
prediction errors, the prediction looks way off the mark.
To illustrate that, we apply the second order exponential smoothing to the
concentration data (same as on Figure 19) with α = 0.05 (see Figure 20) and
α = 0.15 (Figure 21). As you can see, the bigger is α, the more is the reaction to
any random oscillations.
Choice of α. The following procedure is purely heuristic. Compute the average
of the squared one step ahead prediction error over the last half of the data (or over
the last two-thirds of the data, or over any other fixed portion):
X
Q(α) = e2t
40
Concentration Data
18.0
17.5
17.0
16.5
16.0 Time
50 60 70 80 90 100
Concentration Data
18.0
17.5
17.0
16.5 Time
150 160 170 180 190
Figure 22. First order exponential smoothing for the concentra-

tion data, α = 0.092.
Choose the smoothing parameter α that minimizes Q(α). If the optimal α turns
out to be more than 0.2, the method is not working properly (you are giving too
much weight to just a few recent observations). You should either increase the
degree of the polynomial (say, consider first order smoothing instead of zero order
smoothing) and re-evaluate the best α or just look for other prediction methods.
Applying this idea to the concentration data, we get α = 0.092 as the best
value for the first order model (see Figure 22). For the second order model, the
best value is α = 0.072 (see Figure 23).
When does it fail to work? Exponential smoothing may fail to work if the
series suddenly changes it behavior, especially if it has sudden jumps. An example
could be seen on Figure 24. This data set represents a number of directory assistance
calls (monthly, 1962 - 1976). As you can see on the graph, in ten months (May 1973
to March 1974) the number of calls drops down about 80 percents, most likely due
to some business decisions. Before and after this period of time, we can see a linear
trend, so exponential smoothing of order 1 might work. Indeed, as you can see on
Figure 25, exponential smoothing of order 1 with α = 0.1 looks adequate before the
drop. However, predictions made after the drop begins, are, naturally, absolutely
off the mark. It takes more than 36 points (three years) for the prediction to start
making sense again. Even the very last prediction made in July 1976, five points
before the end of the data set, does not look right (it still predicts negative trend).
41
Concentration Data
18.0
17.5
17.0
16.5 Time
150 160 170 180 190
Figure 23. Second order exponential smoothing for the concen-

tration data, α = 0.072.
Assistance Data
Time
50 100 150
Figure 24. Directory assistance calls (in UK?), 1962-1976.
Assistance Data
0 Time
100 120 140 160 180
Figure 25. Exponential smoothing of order 1 for the directory

assistance data, α = 0.1, together with five points predictions made
at various time instances (October 1972, February and October
1974, August 1975 and July 1976)
The only reasonable solution here would be to start everything anew just after the
drop.
There exist heuristical algorithms that allow you to detect such a situation
automatically. Basically, you can use any past section of the data to estimate a
standard deviation of a one step ahead prediction error. If, for several times in a
row, your prediction errors are all of the same sign (say, all positive) and bigger
42
than, say, four standard deviations, this could be considered as an evidence of a

jump or other change in behavior.
Details. Derivation of (3.6) and (3.8) (sketch). Let α = 1 − β. We begin
with some notation. Let
∞
X
S0 (n) = (1 − α)k Xn−k ,
k=0
X∞
S1 (n) = (1 − α)k kXn−k ,
k=0
We also need to know the sums of the following infinite series:

∞
X 1
G0 = (1 − α)k = ,
α
k=0
∞
X 1−α
G1 = k(1 − α)k = ,
α2
k=0
∞
X 2 − 3α + α2
G2 = k 2 (1 − α)k = ,
α3
k=0
The first one of them is simply a geometric series. The others may be obtained
from the first one by term-by-term differentiation.
We begin with the first order exponential smoothing. The equations (3.5) can
be re-written as
S0 (n) − G0 a + G1 b = 0
(3.9)
S1 (n) − G1 a + G2 b = 0
Let â(n), b̂(n) be the solution to (3.9). We have
G0 â(n) − G1 b̂(n) = S0 (n)

(3.10)
G1 â(n) − G2 b̂(n) = S1 (n)
It is more beneficial to re-write (3.10) in matrix form
(3.11) ΛAn = Sn
where

G0 −G1 â(n) S0 (n)
Λ= , An = , Sn =
G1 −G2 b̂(n) S1 (n)
Now, note that
∞
X
S0 (n) = Xn + (1 − α) (1 − α)k−1 Xn−k
k=1
∞
(3.12) X
= Xn + (1 − α) (1 − α)k−1 X(n−1)−(k−1)
k=1
= Xn + (1 − α)S0 (n − 1)
43
In a similar way,
∞
X
S1 (n) = (1 − α) k(1 − α)k−1 Xn−k
k=1
∞
(3.13) X
= (1 − α) (1 + (k − 1))(1 − α)k−1 X(n−1)−(k−1)
k=1
= (1 − α)(S0 (n − 1) + S1 (n − 1))
Finally, note that
(3.14) Xn = X̂n (n − 1) + en = â(n − 1) + b̂(n − 1) + en .
Plugging (3.14) into (3.12), we get
(3.15) S0 (n) = (1 − α)S0 (n − 1) + â(n − 1) + b̂(n − 1) + en .
Once again, re-write (3.15) and (3.13) in matrix form. We get
(3.16) Sn = BSn−1 + ΓAn−1 + En
where
(1 − α) 0 1 1 e
B= , Γ= , En = n
(1 − α) (1 − α) 0 0 0
Substituting (3.11) for Sn and Sn−1 in both sides of (3.16) and solving for An , we
get the following relation between An and An−1 :
(3.17) An = Λ−1 (BΛ + Γ)An−1 + Λ−1 En
Now, direct computation shows that
2α − α2 −α2

(3.18) Λ−1 = ,
α2 −α3 /(1 − α)
which implies

−1 1 1
(3.19) Λ (BΛ + Γ) =
0 1
and
2α − α2

(3.20) Λ−1 En = en ,
α2
which makes (3.17) equivalent to (3.6).
In case of second order exponential smoothing, we need also
∞
X
S2 (n) = (1 − α)k k 2 Xn−k
k=0
and
∞
X 6 − 12α + 7α2 − α3
G3 = k 3 (1 − α)k = ,
α4
k=0
∞
X 24 − 60α + 50α2 − 15α3 + α4
G4 = k 4 (1 − α)k =
α5
k=0
44
Also, similar to (3.13), we get

∞
X
S2 (n) = (1 − α) k 2 (1 − α)k−1 Xn−k
k=1
X∞
= (1 − α) (1 + (k − 1))2 (1 − α)k−1 Xn−k
(3.21) k=1
X∞
= (1 − α) (1 + 2(k − 1) + (k − 1)2 )(1 − α)k−1 X(n−1)−(k−1)
k=1
= (1 − α)(S0 (n − 1) + 2S1 (n − 1) + S2 (n − 1)).
Finally, (3.14) and (3.15) should be replaced by
(3.22) Xn = X̂n (n − 1) + en = â(n − 1) + b̂(n − 1) + ĉ(n − 1) + en
and
(3.23) S0 (n) = (1 − α)S0 (n − 1) + â(n − 1) + b̂(n − 1) + ĉ(n − 1) + en .
Differentiating Q(a, b, c) given by (3.7), we get the equations
G0 â(n) − G1 b̂(n) + G2 ĉ(n) = S0 (n)
(3.24) G1 â(n) − G2 b̂(n) + G3 ĉ(n) = S1 (n)
G2 â(n) − G3 b̂(n) + G4 ĉ(n) = S2 (n).
To re-write them in the form (3.11), we set
     
G0 −G1 G2 â(n) S0 (n)
Λ = G1 −G2 G3  , An =  b̂(n)  , Sn = S1 (n)
G2 −G3 G4 ĉ(n) S2 (n)
For the equations (3.16) and (3.17) to be valid, we set
     
(1 − α) 0 0 1 1 1 en
B = (1 − α) (1 − α) 0 , Γ = 0 0 0 , En =  0 
(1 − α) 2(1 − α) (1 − α) 0 0 0 0
Once again, one can verify that
 
3α2 (α−2)
α(α2 − 3α + 3) 2 α3 /2
3α2 (α−2) 3 2
−28α+20) 4
−1
− α (9α − α4(1−α)
(3α−4) 

(3.25) Λ =  −
,
2 4(1−α)2 2

4
α (3α−4) α5
α3 /2 4(1−α)2 4(1−α)2
which nonetheless implies

 
1 1 1
(3.26) Λ−1 (BΛ + Γ) = 0 1 2
0 0 1
and
α(3 − 3α + α2 )
 
(3.27) Λ−1 En =  3α2 (1 − α/2)  en

α3 /2
which is equivalent to (3.8).
45
Exercises
1. Verify (3.12), (3.13), (3.15), (3.16) and (3.17).
2. (a) Verify (3.18), (3.19) and (3.20). (b) Show that (3.17) - (3.20) imply
(3.6).
3. Verify that (3.8) really describes the second order exponential smoothing.
To this end, verify the corresponding modification of (3.11), (3.16) and (3.17), along
with (3.25), (3.26) and (3.27). Finally, show that (3.17) together with (3.26) and
(3.27) implies (3.8).
4. For the data (will be posted on the web), compute the zero order, first order
and second order exponential smoothing with α = 0.05. Graph the results. Which
one looks more reasonable to you.
5. Do the same with α = 0.1.
4. Stationarity Tests
Does the series contain a trend? There exists a number of tests that allow us
to find out if the series possibly contains trend. We discuss two of them.
Kendall rank correlation test. Let X1 , . . . , Xn be the data. Assume for a
moment that all the values Xi are different. Denote by K the number of pairs i, j
such that i < j and Xi < Xj . The quantity
4K
T = −1
n(n − 1)
is called the Kendall rank correlation. It could be easily seen that −1 ≤ T ≤
1. Also, if the series is monotone increasing, then T = 1, and if it is monotone
decreasing, then T = −1. If the data are independent and identically distributed,
then T is approximately normal with expectation 0 and the variance
2(2n + 5)
σT2 = .
9n(n − 1)
The statistics T has been designed to be sensitive to the presence of the trend, and
can be used in order to test the stationarity. For instance, if |T | > 1.96σT , then
the series possibly contains trend (level of significance of the test is about 95%).
If not all of the values Xi are different, we should define K as the number of
pairs i, j such that i < j and Xi < Xj plus one half of the number of pairs i, j
such that i < j and Xi = Xj .
Spearman rank correlation test. Let X1 , . . . , Xn be the data. Once again,
assume that all the values Xi are different. For each time instant i, let ri be the
rank of Xi , that is 1 plus the number of those j 6= i such that Xj < Xi (so if Xi is
the smallest of X1 , . . . , Xn , then its rank ri is equal to one, if it is the biggest, then
the rank is equal to n). The Spearman rank correlation is defined by the formula
Pn
6 i=1 (ri − i)2
S =1−
n(n2 − 1)
Once again, if the series is monotone increasing, then S = 1. If it is monotone
decreasing, then S = −1. It could be shown that, if the data are independent and
identically distributed, then S is approximately normal with expectation 0 and the
variance
1
σS2 = .
n
46
White Noise
30
20
10
Time
20 40 60 80 100
-10
-20
-30
Figure 26. White noise (simulated), σ = 14.
If not all of the values Xi are different, we should define ri as 1 plus the
number of j 6= i such that Xj < Xi plus one half of the number of j 6= i such that
Xj = Xi . In other words, we rearrange the data in the increasing order. If a value
is not repeated, then its rank is exactly its position in the new list. If the value
is repeated several times, then each of them gets the average of the corresponding
ranks. So, for instance, if Xi is the smallest value and it is taken exactly twice,
then each of those observations gets the rank 1.5. If it is repeated six times, then
each of the values gets the rank 3.5.
Both statistics have similar meaning, and work similar. The power of the
corresponding tests is approximately equal.
Example. The following example shows how it works. Suppose the values of
the series are equal to
−4, 21, 14, 10, 22, −11, 14, 18, 14, 21, 19, 20
(n = 12). For the Kendall coefficient, we have here forty pairs such that i < j
and Xi < Xj , and four pairs such that i < j and Xi = Xj (the value 21 is taken
twice and the value 14 is taken three times). That gives us K = 42 and T = 0.273.
However, T /σT = 1.235 which is much less than Z0.025 = 1.96, so Kendall test
turns negative at 5 % level of significance.
For the Spearman coefficient, we have to find the ranks. They are equal
2, 10.5, 5, 3, 12, 1, 5, 7, 5, 10.5, 8, 9
(note that some of the ranks are repeated). That gives us S = 0.33 and S/σS =
1.145, again insignificant at 5 % level of significance.
More Examples and Comments. 1. If you can see a trend on the graph
with your naked eye, don’t even bother to do any testing. The data on Figure 26
represent simulated i.i.d. normal random variables (100 values) with zero mean and
σ = 14.
On Figures 27 and 28, you can see how little of the trend do we need to
add, to make it statistically significant: for Xt = −0.1t + noise and for Xt =
0.13t + noise, both Kendall and Spearman tests give a positive answer at the 5%
level of significance.
2. The tests are not fool-proof. If the trend is not monotone, the tests may
detect it or fail to do so. For instance, both Kendall and Spearman tests are positive
for the data shown on Figure 29 and they are negative for the data shown on Figure
30 though both series contain a trend that goes up and down. Still, both tests turn
47
Data
20
10
Time
20 40 60 80 100
-10
-20
-30
Figure 27. Xt = −0.1t + noise. Both Spearman and Kendall

tests positive at 5% level of significance
Data
20
10
Time
20 40 60 80 100
-10
Figure 28. Xt = 0.13t + noise. Both Spearman and Kendall tests

positive at 5% level of significance
20 40 60 80 100
-1
-2
Figure 29. Both tests detect a trend in this data set...
positive if we apply them to an appropriate portion of the second data set, say, to
the first or last third of the series.
3. The tests are sensitive to somewhat different things and may disagree with
each other. Say, according to the Kendall test, there is no trend in the data shown
on Figure 31 and there is a trend in the data shown on Figure 32. According to
the Spearman test, the situation is opposite. And, those data sets are parts of the
same time series — the water level in Dead Sea (Figure 14 in the Introduction).
48
20 40 60 80 100
-2
-4
Figure 30. ... but fail to do so for this data set.
600
580
560
540
520
500
480
0 5 10 15 20 25
Figure 31. Water level in Dead Sea, years 1904-1932. Spearman

test is positive, Kendall test is negative
360
340
320
300
280
0 5 10 15 20
Figure 32. Water level in Dead Sea, years 1934 - 1956. Kendall
test is positive and Spearman test is negative this time.
4. Both tests are designed to catch a presence of a trend. However, absence of

a trend does not make the series stationary. For instance, its variance or correlation
structure may also change in time. In order to find out if the series is stationary,
we should apply those tests not only to the original data Xt , but also to Xt2 (in
order to catch possible changes in the variance) and to cross products like Xt Xt+1 ,
in order to be able to detect changes in the correlation structure. The following
famous series (Figure 33) represents the stock market data, daily price for the IBM
stock way back some time in sixties.
49
IBM Stock Data
600
550
500
450
400
350
Time
0 50 100 150 200 250 300 350
Figure 33. IBM stock market data
Data
20
10
Time
50 100 150 200 250 300 350
-10
-20
Figure 34. Increments of the IBM stock market data
Data
30
20
10
Time
50 100 150 200 250 300 350
-10
-20
-30
-40
Figure 35. Second increments of the IBM stock market data. No

signs of the trend. However, the variance, apparently, jumps up
somewhere between points 230 and 250
The series is clearly non-stationary. But, if we take the increments Xt − Xt−1

(Figure 34), and then, the increments of the increments (second increments; Figure
35), they look like a series with zero expectation.
Is this new series stationary? If we examine the graph carefully, we may note
that the variance of the data possibly jumps up somewhere around the point 235.
Indeed, the standard deviation estimated on the interval before t = 235, equals to
6.4; after the point 235 we have 13.87. If we square the data (Figure 36) or just
50
Data
1000
800
600
400
200
Time
0 50 100 150 200 250 300 350
Figure 36. Squares of the second increments of the IBM stock

market data. Both Kendall and Spearman tests are positive at 5%
level of significance
Data
50
40
30
20
10
Time
0 50 100 150 200 250 300 350
Figure 37. Absolute values of the second increments of the IBM

stock market data. Again, both Kendall and Spearman tests are
positive at 5% level of significance
take the absolute values (Figure 37), we can try the tests again. Both of the tests
allow us to claim that the series is not stationary.
Exercises
1. Show that if the series is monotone increasing, then S = T = 1.
2. Show that if the series is monotone decreasing, then S = T = −1. Hint:
You may find the following formula useful:
n(n + 1)(2n + 1)
12 + 22 + · · · + n2 =
6
3. For a given data set (will be posted in the web), find Kendall and Spearman
rank correlation coefficients. Based on them, test the data for stationarity at 5%
Project. For a given noise data εt , t = 1, . . . , 30 (will be posted on the web),
find out how much of the trend is necessary to be detected by the Kendall test. That
is, how big should be a (up to two decimal places) to make the series Xt = at + εt
test positive? Same about Xt = −at + εt . Same question about the Spearman test.
Which of the tests turns out to be more sensitive?
CHAPTER 2
Stationary Models
1. Stochastic Processes and Stationarity

A probabilistic approach to time series analysis is based on the concept of a
stochastic process.
A stochastic process is a family of random variables Xt indexed by time t.
We concentrate on the case when the time is discrete (t = 0, 1, 2, . . . or t =
. . . , −1, 0, 1, . . . or t = t0 , t0 + δ, t0 + 2δ, . . . ). Unless we say otherwise, we shall
assume that the time is integer-valued. From this point of view, a time series is a
realization (a trajectory) of a stochastic process.
We begin with some examples.
Examples.
1. White noise. This is the simplest possible stochastic process. Although not
interesting by itself, it is a building block for other, more important examples. It
is just a sequence of independent identically distributed (i.i.d.) random variables
Xt , t = . . . , −1, 0, 1, . . . . Unless we say otherwise, we assume in addition that
EXt = 0 and Var Xt = σ 2 (so the variance is finite). White noise is called Gaussian
if Xt have a normal distribution N (0, σ 2 ).
2. Trend + noise. Let Xt = a + bt + εt , where εt is a white noise and a, b are
non-random constants. Random variables Xt are still independent, but no longer
identically distributed (if b 6= 0).
3. Sums of independent random variables. Suppose εt is a white noise. Let
(1.1) X0 = 0, Xt = ε1 + ε2 + · · · + εt , t = 1, 2, . . .
Random variables Xt are no longer independent (and the distribution changes in
time).
4. Increments of the trend + noise process. Let Xt be as in the example 2, and
let
(1.2) Yt = Xt − Xt−1 = b + (εt − εt−1 )
where εt is the noise from the example 2. Random variables Yt are identically
distributed but no longer independent. However, Yt and Ys are independent if
|t − s| ≥ 2.
5. Random oscillation. This example (our last one, for the moment) has a
different structure. Let A and B be random variables such that EA = EB =
0, Var A = Var B = σ 2 and Cov(A, B) = 0. Let also ω > 0 be a constant. The
process
(1.3) Xt = A cos(ωt) + B sin(ωt)
is a random oscillation with given frequency ω, its magnitude and phase depend
on A and B. If we know A and B, we know the whole trajectory. A bit different
51
52
version of this process is given by the formula

(1.4) Xt = A cos(ωt + φ)
where ω > 0 is again a constant, A is a random variable and φ is another random
variable that is uniformly distributed on the interval [0, 2π] and independent from
A. Once again, if we know A and φ, we know the whole trajectory.
A stochastic process is called Gaussian if a joint distribution of any finite
collection Xt , Xt+1 , . . . , Xt+k is multivariate normal. In our examples 2-4, the
process is Gaussian if the corresponding noise εt is Gaussian. In the example 5,
the process (1.3) is Gaussian if A and B have a bivariate normal distribution. The
process (1.4) is Gaussian if A2 has exponential distribution (see Problem 23 in
Appendix A).
Stationary Processes. With every stochastic process, one can associate its
expectation, or mean value function µ(t) = EXt , its variance σ 2 (t) = Var Xt =
E(Xt − µ(t))2 and its autocovariance function, R(s, t) = Cov(Xs , Xt ) = E(Xs −
µ(s))(Xt − µ(t)).
A process is called weakly, or second order stationary if its expectation and
variance do not change in time, and the autocovariance function depends only on
the difference between the arguments:
µ(t) = µ(t + h), σ 2 (t) = σ 2 (t + h), R(s, t) = R(s + h, t + h)
for any h.
A more restrictive is a concept of a strictly, or completely, stationary process. A
stochastic process is completely stationary if a distribution of Xt does not change
in time, a joint distribution of two consecutive values Xt , Xt+1 does not change
in time, a joint distribution of a triplet (Xt , Xt+1 , Xt+2 ) does not change in time,
and so on. If the process is Gaussian, then the concepts of weak and complete
stationarity coincide (because the multivariate Gaussian distribution is completely
determined by the vector of expectations and the covariance matrix).
From this point on, stationarity means second order stationarity.
In case of a stationary process Xt , we write µ or µX for its expectation and σ 2
2
or σX for its variance. Also, we define its autocovariance function, or ACV, as
R(t) = RX (t) = Cov(Xs , Xs+t )
(this notation does not agree with the notation for the non-stationary processes;
however, we’ll never use the covariance function for non-stationary processes).
Examples revisited.
1. For the white noise, µ(t) = 0, σ 2 (t) = σ 2 and Cov(Xs , Xt ) = 0 if s 6= t and
2
σ if s = t. So Xt is actually a weakly (and even completely) stationary process.
2. Since µ(t) = a+bt changes in time, the process is not stationary. Its variance
and autocovariance functions are actually same as in the example 1.P
t
3. For the sums of independent random variables, µ(t) = EXt = i=1 Eεi = 0.
However,
X t Xt
Var Xt = Var( εi ) = Var(εi ) = t Var ε1
i=1 i=1
grows in time, so the process is not stationary as well.
4. Both µ(t) = b and σ 2 (t) = Var(Yt ) = Var(εt+1 − εt ) = Var(εt+1 ) + Var(εt ) =
2
2σε do not change in time. As we noticed before, Cov(Ys , Yt ) = 0 if the distance
53
between s and t is greater than 1. Since Cov(Xt , Xt ) = Var(Xt ) = 2σε2 has been
already calculated, it remains to find Cov(Xt , Xt+1 ). We have
Cov(Xt , Xt+1 ) = Cov(εt+1 − εt , εt+2 − εt+1 )
= Cov(εt+1 , εt+2 ) − Cov(εt+1 , εt+1 ) − Cov(εt , εt+2 ) + Cov(εt , εt+1 )
= −σε2
Hence, the process is second order stationary (in fact, it is completely stationary
though we did not show that). For its autocovariance function, R(0) = 2σε2 and
R(1) = R(−1) = −σε2 , all the other values of R are zeros.
5. This is also a second order stationary process. Indeed, suppose the process
is defined by the formula (1.3). Then
EXt = E(A) cos(ωt) + E(B) sin(ωt) = 0
and
Var Xt = Var(A) cos2 (ωt) + Var(B) sin2 (ωt) + 2 Cov(A, B) cos(ωt) sin(ωt) = σ 2 .
In a similar way, it could be shown (Problem 1) that
(1.5) Cov(Xs , Xs+t ) = σ 2 cos(ωt)
If the process is defined by the equation (1.4), then it is even strictly stationary.
We will only verify second order stationarity though. We could use trig identity
and transform this equation into the (1.3) form, or we could proceed as follows.
First, we notice that
Z 2π
1
E cos(ωt + φ) = cos(ωt + x) dx = 0
2π 0
and therefore
EXt = E(A cos(ωt + φ)) = EA E cos(ωt + φ) = 0.
In a similar way,
Z 2π
1 1
E cos2 (ωt + φ) = cos2 (ωt + x) dx =
2π 0 2
and
Var Xt = EXt2 = E(A2 )E cos2 (ωt + φ) = E(A2 )/2.
Along the same lines, one can verify that
1
(1.6) Cov(Xs , Xs+t ) =E(A2 ) cos(ωt)
2
Autocorrelation and Partial Autocorrelation. Let Xt be a stationary
process with autocovariance function R(t). Its autocorrelation function or ACF, is
defined by the formula
Cov(Xs , Xs+t ) R(t)
(1.7) ρ(t) = Corr(Xs , Xs+t ) = p =
Var(Xs ) Var(Xs+t ) R(0)
So, autocorrelation function is just the autocovariance function normalized to have
ρ(0) = 1. Autocorrelation function has the following properties.
1. It is even: ρ(t) = ρ(−t).
2. ρ(0) = 1.
54
3. It is positively semi-definite. That is, if c1 , . . . , cn are real numbers and

t1 , . . . , tn are time instances, then
n X
X n
(1.8) ck cl ρ(tk − tl ) ≥ 0
k=1 l=1
Pn
To verify the last property, consider a random variable W = k=1 ck Xtk . Its
variance Var W must be non-negative. However,
n X
X n n X
X n
(1.9) 0 ≤ Var W = ck cl Cov(Xtk , Xtl ) = ck cl R(tk − tl ).
k=1 l=1 k=1 l=1
Dividing (1.9) by R(0), we get (1.8).

A partial autocorrelation function, or PACF φ(k) is defined for non-negative k
only. Namely, we set φ(0) = 1 and

1 ρ(1) ... ρ(k − 2) ρ(1)
ρ(1)
1 . .. ρ(k − 3) ρ(2)
... . . . . . . ... . . .

ρ(k − 2) ρ(k − 3) . . . 1 ρ(k − 1)

ρ(k − 1) ρ(k − 2) . . . ρ(1) ρ(k)
(1.10) φ(k) =
1 ρ(1) ... ρ(k − 2) ρ(k − 1)
ρ(1)
1 ... ρ(k − 3) ρ(k − 2)
... . . . ... ... . . .

ρ(k − 2) ρ(k − 3) . . . 1 ρ(1)

ρ(k − 1) ρ(k − 2) . . . ρ(1) 1
for k ≥ 1. As you can see, the determinants differ by the last column only. For
k = 1, we get φ(1) = ρ(1). It could be shown that |φ(k)| ≤ 1 for all k.
PACF and the best linear predictor. The formula (1.10) looks like some-
thing out of the blue. However, as we’ll see later, it plays an important role in the
theory of autoregressive models.
The following property partly reveals the mystery. Let Xt be a stationary
process with zero expectation. Suppose that we want to approximate a random
variable Xt+k by a linear combination of Xt , Xt+1 , . . . , Xt+k−1 in the sense of least
squares. That is, we want to minimize the expectation
Q(b1 , . . . , bk ) = E(Xt+k − b1 Xt+k−1 − · · · − bk Xt )2
To do so, we should expand the expression, note that E(Xt Xs ) = R(t − s), set
the partial derivatives with respect to bi to zero and solve the corresponding linear
equations (similar to the multiple linear regression). It could be shown (see a sketch
below) that the optimal coefficient bk (the last one) coincides with φ(k).
Yet another interpretation for the partial autocorrelation function is as follows.
Suppose we already found the coefficients b1 , . . . , bk described above. Denote by Z1
the prediction error Z1 = Xt+k −b1 Xt+k−1 −· · ·−bk Xt . Next, we want to find a best
prediction for Xt−1 given the same Xt , Xt+1 , . . . , Xt+k−1 . It could be shown that
the solution is given by the formula b1 Xt +· · ·+bk Xt+k−1 , with the same coefficients
b1 , . . . , bk , taken in the reversed order. Let now Z2 = Xt−1 − b1 Xt − · · · − bk Xt+k−1
be the corresponding prediction error. It turns out that Corr(Z1 , Z2 ) = φ(k + 1)
(and, for that reason, it does not exceed one in absolute value).
55
Data
6
Time
20 40 60 80 100 120 140
-2
-4
Figure 1. For this stationary process, ρ(1) = 0.9.
Data
6
Time
20 40 60 80 100 120 140
-2
-4
-6
Figure 2. For this stationary process, ρ(1) = −0.9.
Further Comments and Examples. In fact, autocorrelation function says

a lot about the behavior of the process. Say, if ρ(1) is close to 1, then Xt+1 and Xt
are alike (in the sense of probability). It means that Xt slowly moves up and down,
if Xt is large, then Xt+1 is also large and the other way around. On the contrary,
if ρ(1) is close to −1, then Xt and Xt+1 are opposite, if one is large, the other one
is small. To see that, look at the examples shown on Figures 1 and 2. If a series
contains a cycle, that is if it is quasi-periodic, then its autocorrelation function is
also quasi-periodic.
Details. Best linear predictor (sketch). Let Xt be a stationary process
with zero expectation and let b1 Xt−1 + · · · + bk Xt−k be a predictor for Xt . Then
a square of the prediction error is given by the formula
k
X k
X
(Xt − b1 Xt−1 − · · · − bk Xt−k )2 =Xt2 + b2i Xt−i
2
−2 bi Xt Xt−i
i=1 i=1
k−1
X k
X
+2 bi bj Xt−i Xt−j
i=1 j=i+1
Now, since the expectation of Xt is equal to zero, E(Xt−i )2 = σ 2 and EXt−i Xt−j =
R(j − i). For this reason,
k
X k
X k−1
X k
X
Q(b1 , . . . , bk ) = σ 2 (1 + b2i ) − 2 bi R(i) + 2 bi bj R(j − i)
i=1 i=1 i=1 j=i+1
56
Dividing by σ 2 , we get
k k k−1 k
1 X
2
X X X
Q(b1 , . . . , b k ) = 1 + bi − 2 bi ρ(i) + 2 bi bj ρ(j − i)
σ2 i=1 i=1 i=1 j=i+1
Differentiating with respect to bi and setting the partial derivatives to zero, we get
the equation
X
bi − ρ(i) + bj ρ(i − j) = 0
j,j6=i
Since ρ(0) = 1, it is equivalent to

k
X
(1.11) bj ρ(i − j) = ρ(i),
j=1
i = 1, . . . , k. Solving those linear equations, we get (1.10) for bk .
Exercises
1. Show that the autocovariance function of the process (1.3) in the example
5 is indeed given by the formula (1.5)
2*. By setting t1 = −1, t2 = 0, t3 = 1 in (1.8), show that
ρ(2) ≥ 2ρ(1)2 − 1.
Conclude from here that −1 ≤ φ(2) ≤ 1.
Hint: Let c2 = 1, c1 = a, c3 = b and find a, b that minimize the expression in
(1.8). Compute the minimal value. Still, it must be non-negative. For the second
part, use the definition of the partial autocorrelation function.
3. For the process (1.1), show that R(s, t) = σε2 min(s, t).
4. Find the formula for the ACF of the process (1.2).
5. Find first four values of the PACF (that is, φ(1), φ(2), φ(3), φ(4)) for the
process (1.2).
6. Let Xt and Yt be two second order stationary processes such that Xt and Ys
are independent for all s and t. Show that Zt = Xt + Yt is second order stationary
and its autocovariance function RZ (k) = RX (k) + RY (k) for all k.
7. Let Xt be a second order stationary process with autocovariance function
RX (k), and let Yt = (Xt + Xt−1 )/2. Verify that Yt is a second order stationary
process. Find its autocovariance function in terms of RX (k).
8. Let Xt be a stationary process with zero expectation and autocorrelation
function ρ(k). (a) Show that b = ρ(1) minimizes the expectation
E(Xt − bXt−1 )2
as well as the expectation
E(Xt−2 − bXt−1 )2 .
(b) Verify that
Corr(Xt − ρ(1)Xt−1 , Xt−2 − ρ(1)Xt−1 ) = φ(2).
9*. Let Xt be a stationary process with zero expectation and autocorrelation
function ρ(k). (a) Find the coefficients b1 , b2 that minimize the expectation
E(Xt − b1 Xt−1 − b2 Xt−2 )2
57
and verify that b2 = φ(2). (b) For the coefficients b1 , b2 from part (a), compute
Corr(Xt − b1 Xt−1 − b2 Xt−2 , Xt−3 − b1 Xt−2 − b2 Xt−1 )
and compare it with φ(3).
2. Moving Average Processes

Moving average process of the 1st order, or MA(1). Suppose εt is a
white noise with variance σε2 , and let µ, b be real numbers. The process
(2.1) Xt = µ + εt + bεt−1
is second order stationary (even strictly stationary), with expectation µ. It is called
a moving average of the order 1. Though Xt is stationary no matter what is the
value of the parameter b, the further analysis such as estimation, prediction and so
on, is very complicated unless we assume that |b| < 1 (this is so called invertibility
condition, see later in Section 7). We have R(0) = Var(Xt ) = (1 + b2 )σε2 and
(2.2) R(1) = R(−1) = Cov(Xt , Xt+1 ) = E(εt + bεt−1 )(εt+1 + bεt ) = bσε2
which implies
b
(2.3) ρ(1) = ρ(−1) =
1 + b2
All the values R(t) and ρ(t) with t 6= −1, 0, 1 are equal to zero due to the indepen-
dency of εt .
However, the partial autocorrelations have another structure. Since ρ(k) = 0
for all k ≥ 2, (1.10) implies

1 ρ(1)

ρ(1) 0 ρ(1)2 b2
φ(2) = =− = − 6= 0
1 ρ(1) 1 − ρ(1)2 1 + b2 + b4
ρ(1) 1
The next value of PACF,

1 ρ(1) ρ(1)

ρ(1) 1 0

0 ρ(1) 0 ρ(1)3 b3
φ(3) = = = 6= 0
1 ρ(1) 0 1 − 2ρ(1)2 1 + b2 + b4 + b6
ρ(1) 1 ρ(1)

0 ρ(1) 1
Moreover, it could be shown (see Problem 8) that
bk
(2.4) φ(k) = (−1)k+1
1 + b2 + · · · + b2k
so it is never zero though decays at exponential rate (if b 6= 1).
Moving average process of the order l, or MA(l). It is defined by the
formula
(2.5) Xt = µ + b0 εt + b1 εt−1 + · · · + bl εt−l
where εt is a white noise and b0 , . . . , bl are the coefficients. Usually, b0 = 1. The
process Xt is stationary no matter what are the values of the coefficients b0 , . . . , bl ,
58
Data
4
3
2
1
Time
20 40 60 80 100 120 140
-1
-2
-3
Figure 3. Simulated MA(1) process Xt = εt + 0.9εt−1 .
ACF
1.0
0.5
Lags
1 5 10 15 20 25 30
-0.5
-1.0
Figure 4. ACF of the process Xt = εt + 0.9εt−1 .
but any further analysis requires the so called invertibility condition (Section 7).
The expectation of Xt is equal to µ and its variance is equal to
(2.6) σ 2 = (b20 + · · · + b2l )σε2
The autocovariance function is given by the formula
(
(b0 bi + b1 bi+1 + · · · + bl−i bl )σε2 if i = 0, . . . , l
(2.7) R(i) =
0 if i > l
(and, of course, R(−i) = R(i) defines the values of R for negative arguments).
From (2.6) and (2.7),
( (b b +b b +···+b b )
0 i 1 i+1 l−i l
(b20 +b21 +···+b2l )
if i = 0, . . . , l
(2.8) ρ(i) =
0 if i > l
There is no simple formula for the partial autocorrelation other than (1.10). Still,
as in case of MA(1) process, φ(k) does not vanish outside of any finite interval.
Examples. On Figures 3 - 10 you can see some simulated MA(1) and MA(2)
processes together with their ACF and PACF.
General Linear Process (or a Moving Average of P∞Infinite Order).
Suppose bi , i = 0, 1, 2, . . . is an infinite sequence such that i=0 b2i < ∞. For a
white noise εt , let
Xn
Sn,t = µ + bi εt−i
i=0
59
PACF
1.0
0.5
Lags
1 5 10 15 20 25 30
-0.5
-1.0
Figure 5. PACF of the process Xt = εt + 0.9εt−1 .
Data
3
2
1
Time
20 40 60 80 100 120 140
-1
-2
-3
Figure 6. Simulated MA(1) process Xt = εt − 0.9εt−1 .
ACF
1.0
0.5
Lags
1 5 10 15 20 25 30
-0.5
-1.0
Figure 7. ACF of the process Xt = εt − 0.9εt−1 .
Pm
Let n < m. Note that Sn,t −Sm,t = k=n+1 bk εt−k and therefore Var(Sn,t −Sm,t ) =
Pm 2 2
k=n+1 bk σε . So, Var(Sn,t − Sm,t ) → 0 as n, m → ∞ which means that the
sequence of random variables Sn,t converges in a certain sense (so called convergence
in L2 , or in mean squares, see Appendix A, section 4). Hence, we can define a
stochastic process
∞
X
Xt = lim Sn,t = µ + bi εt−i
n→∞
i=0
60
PACF
1.0
0.5
Lags
1 5 10 15 20 25 30
-0.5
-1.0
Figure 8. PACF of the process Xt = εt − 0.9εt−1 .
Data
6
4
2
Time
20 40 60 80 100 120 140
-2
-4
-6
Figure 9. Simulated MA(2) process Xt = εt + 1.8εt−1 + 0.9εt−2 .
Data
6
4
2
Time
20 40 60 80 100 120 140
-2
-4
-6
Figure 10. Simulated MA(2) process Xt = εt − 1.8εt−1 + 0.9εt−2 .
In a sense, it is a moving average process of infinite order. Its expectation EXt = µ.

It is called a general linear process. As above, its variance is equal to
∞
X
(2.9) σ 2 = R(0) = σε2 b2i
i=0
and its autocovariance function is given by the formula

∞
X
(2.10) R(k) = σε2 bi bi+k , k > 0.
i=0
61
As above, it implies
P∞
i=0 bi bi+k
(2.11) ρ(k) = P ∞ 2 , k > 0.
i=0 bi
Remark. Moving average processes, together with general linear processes,
have one important property. Since Xt is a linear combination of past values of
the noise εt , or a limit of those, and since εt is a white noise, that is a sequence of
i.i.d. random variables, Xt is independent from future values of the noise εt+k , k > 0.
Moreover, we will see in Section 10 that εt is, in fact, a one step ahead prediction
error.
Exercises
1. Let
Xt = m + εt + aεt−1
and
1
Yt = m + ηt + ηt−1
a
where εt and ηt are white noises. If ση2 = a2 σε2 , then the processes Xt and Yt have
the same mean and variance and the same autocorrelation function.
2. For a process
Xt = εt + εt−1 + 0.6εt−2 ,
find its autocorrelation function, and the first three values φ(1), φ(2), φ(3) of the
partial autocorrelation function.
3. The same question for a process
Xt = εt + 2εt−1 + εt−2 .
Xt = εt − 3εt−1 + 3εt−2 − εt−3 .
3 1
Xt = εt − εt−1 + εt−2 − εt−3 .
2 4
6. Find the autocorrelation function of the process described by the equation
εt + · · · + εt−l
Xt =
l+1
7. For a general linear process
Xt = εt + bεt−1 + b2 εt−2 + · · · + bk εt−k + . . . ,
where |b| < 1, find its variance and the autocorrelation function.
8*. Verify (2.4).
Hint: As a first step, show by induction that, for a k × k matrix
(1 + b2 )
 
b 0 ... 0 0
 b
 (1 + b2 ) b ... 0 0  
 0 2
 b (1 + b ) . . . 0 0  ,
 ... ... ... ... ... 
 2

 0 0 0 . . . (1 + b ) b 
0 0 0 ... b (1 + b2 )
its determinant is equal to 1 + b2 + · · · + b2k .
62
3. Autoregression of the First Order

A stationary process Xt with zero mean is called an autoregression of the first
order, or AR(1), for some real number a and some white noise εt
(3.1) Xt − aXt−1 = εt
for all t.
The relation (3.1) alone is not enough, we have to assume that
(3.2) − 1 < a < 1.
The inequality (3.2) is called the stationarity condition.
A stationary process Yt with expectation µ is autoregression of the first order
if Xt = Yt − µ is an AR(1) with zero expectation. In this case, the equation (3.1)
is equivalent to
(3.3) Yt − aYt−1 = c + εt
for all t, where c = (1 − a)µ.
For the moment, we’ll focus on processes with zero expectation.
Iterating (3.10), we get
(3.4) Xt = εt + aεt−1 + · · · + ak−1 εt−k+1 + ak Xt−k
and the variance of the last term Var(ak Xt−k ) = a2k σX
2
→ 0 as k → ∞. Passing
to the limit as n → ∞, we arrive at the representation
∞
X
(3.5) Xt = ai εt−i
i=0
In other words, we have represented Xt as a general linear process. According to

(3.5), Xt is a linear combination of past values of the noise, and therefore Xt is
independent from future values of the noise εt+k , k = 1, 2, . . . .
Stationarity and asymptotic stationarity. Equation (3.1) does not guar-
antee that the process is stationary, even together with the stationarity condition.
In fact, everything depends on X0 , it should have correct expectation and variance.
However, suppose that Xt is a stationary process described by (3.1) and let Yt be
another process that also follows the equation (3.1) with the same noise εt . Denote
Wt = Xt − Yt . Subtracting the equation (3.1) for Yt from the same equation for
the process Xt , we get
(3.6) Wt = Xt − Yt = (aXt−1 + εt−1 ) − (aYt−1 − εt−1 ) = a(Xt−1 − Yt−1 ) = aWt−1
Iterating (3.6), we have Wt = at W0 . Therefore
(3.7) |Wt | → 0 if and only if |a| < 1.
Therefore if the stationarity condition is satisfied, then the process Yt actually
converges to the stationary process Xt as t → ∞, that is, Yt is asymptotically
stationary. Yet another way to argue is as follows. Similar to (3.4), we get
(3.8) Yt = εt + aεt−1 + · · · + at−1 ε1 + at Y0 .
If |a| < 1, then the last term goes to zero as t → ∞, and, no matter
P∞ what is the
value of Y0 , the process converges to a stationary process Xt = i=0 ai εt−i .
63
Data
6
Time
20 40 60 80 100 120 140
-2
-4
Figure 11. Simulated AR(1) process Xt = 0.9Xt−1 + εt .
ACF
1.0
0.5
Lags
1 5 10 15 20 25 30
-0.5
-1.0
Figure 12. ACF of the process Xt = 0.9Xt−1 + εt .
Autocorrelation function of AR(1). Let k > 0. From (3.4),

Cov(Xt−k , Xt ) = Cov(Xt−k , εt ) + a Cov(Xt−k , εt−1 ) + . . .
+ ak−1 Cov(Xt−k , εt−k+1 ) + ak Cov(Xt−k , Xt−k )
= ak Var Xt−k = ak σX
2
Therefore,
(3.9) R(k) = a|k| σX
2
, ρ(k) = a|k|
Partial autocorrelation function of AR(1). Let us look at the determinant
in the numerator of the formula (1.10). To be specific, let us compare the first and
the last column (assuming, of course, that k > 1). According to (3.9), the first
column contains the numbers 1, a, a2 , . . . , ak−1 . On the other hand, the last column
contains the numbers a, a2 , a3 , . . . , ak . Therefore the last column is proportional
to the first one. Hence, the corresponding matrix is degenerate and therefore its
determinant is equal to zero. Therefore, the partial autocorrelation function φ(k) =
0 if k > 1 (and, of course, φ(0) = 1, φ(1) = ρ(1) = a).
Examples. Depending on the sign of a, we have two different types of behavior.
If a is negative, the series oscillates heavily. If a is positive, the series slowly goes
up and down, though not periodically. On Figures 11 - 16 you can see simulated
AR(1) processes with a = 0.9 and a = −0.9, together with their autocorrelation
and partial autocorrelation functions.
64
PACF
1.0
0.5
Lags
1 5 10 15 20 25 30
-0.5
-1.0
Figure 13. PACF of the process Xt = 0.9Xt−1 + εt .
Data
6
Time
20 40 60 80 100 120 140
-2
-4
-6
Figure 14. Simulated AR(1) process Xt = −0.9Xt−1 + εt .
ACF
1.0
0.5
Lags
1 5 10 15 20 25 30
-0.5
-1.0
Figure 15. ACF of the process Xt = −0.9Xt−1 + εt .
Remark. Independency of Xt and εt+k is in fact equivalent to the stationarity

condition (3.2). To verify that, let us assume that a stationary process Xt , defined
by (3.1), is such that, for all t and all k > 0, Xt and εt+k are independent. We
have
(3.10) Xt = aXt−1 + εt
and therefore
Var Xt = a2 Var Xt−1 + σε2 + 2a Cov(εt , Xt−1 )
65
PACF
1.0
0.5
Lags
1 5 10 15 20 25 30
-0.5
-1.0
Figure 16. PACF of the process Xt = −0.9Xt−1 + εt .
However, Cov(εt , Xt−1 ) = 0 because of the independency property, and therefore

2
σX = a2 σX
2
+ σε2
which implies
2 σε2
σX = .
1 − a2
2
Since σX must be non-negative and σε2 is non-negative, this is possible only if |a| < 1
as promised.
Exercises
1. Does the equation
Xt + 0.75Xt−1 = εt
describe a stationary process? If so, what is its expectation? What is the variance?
What is the autocorrelation function? Which of the following data (graphs will be
posted on the web) may possibly be described by this equation?
2. The same questions for the equation
Xt − 0.75Xt−1 = εt
3. Does the equation describe a stationary process? If so, what is its expecta-
tion? What is the variance?
Xt + 0.5Xt−1 = 3 + εt
4. Same questions for the equation
Xt + Xt−1 = 3 + εt
5. Same questions for the equation
Xt − 2Xt−1 = εt
4. Back Shift Operator

Many of the above formulas can be simplified if we use a back shift operator.
The back shift operator B acts on sequences (time series, stochastic processes etc.)
as follows:
BXt = Xt−1 , t = . . . , −1, 0, 1, . . .
Respectively, B 2 Xt = B(BXt ) = BXt−1 = Xt−2 and so on, B k Xt = Xt−k .
66
For instance,
(1 − 0.2B)Xt = Xt − 0.2BXt = Xt − 0.2Xt−1
The other way around,
(εt + εt−1 )/2 = (εt + Bεt )/2 = [(1 + B)/2]εt = (0.5 + 0.5B)εt
This notation is especially useful if we wish to consider several operations applied
to the series one after one. For instance, suppose we first want to take the incre-
ments (Xt − Xt−1 ), and after that, the averages of the two consecutive increments.
Formally, we should, first, consider Yt = Xt − Xt−1 and then, Zt = (Yt + Yt−1 )/2.
However, this boils down to the formula Zt = 0.5Xt − 0.5Xt−2 . In terms of the
back shift operator, we have
Zt = (0.5 + 0.5B)Yt = (0.5 + 0.5B)(1 − B)Xt
and
(0.5 + 0.5B)(1 − B) = 0.5(1 + B)(1 − B) = 0.5(1 − B 2 )
Operator (1 − B)Xt = Xt − Xt−1 is called the difference operator and has
a special notation ∇Xt (reads ‘nabla’). It is useful when we need to reduce the
non-stationary process to a stationary one. For instance, a process (1.1) can be
characterized by the equations
X0 = 0, Xt = Xt−1 + εt
which can be rewritten as
∇Xt = εt .
2
An expression ∇ Xt = Xt − 2Xt−1 + Xt−2 is called a second difference, and so on.
For convenience, we set ∇0 = 1, so that ∇0 Xt = Xt .
In a similar way, we define a seasonal difference ∇d Xt = (1−B d )Xt = Xt −Xt−d
which is useful, for instance, when we deal with monthly, etc., data, and expect
certain annual cycles.
In terms of B, the AR(1) model (3.1) can be rewritten as
(4.1) (1 − aB)Xt = εt
and the representation (3.5) becomes
X∞
(4.2) Xt = ( ai B i )εt .
i=0
or
Xt = (1 + aB + a2 B 2 + . . . )εt
which agrees with the power series representation
1
(4.3) (1 − az)−1 = = 1 + az + a2 z 2 + . . . , |az| < 1,
1 − az
and with the following formal implication of (4.1)
(4.4) Xt = (1 − aB)−1 εt
(which, as written, makes no sense because the right side is not formally defined).
Rule of a thumb: in formal calculations related to the power series expansions,
B should be treated as a number with absolute value that is equal to 1.
Of course, (4.4) and (4.3) do not prove (3.5) or (4.2). However, they lead to a
correct formula which can be eventually justified by other means.
67
Exercises
1. For each of the following equations, rewrite them in B notation.
Xt − 0.4Xt−1 = εt
Xt = εt + εt−1 − 0.5εt−2
Yt = Xt − Xt−1 + εt
2. Following equations are written in B notation. Rewrite them in ordinary
notation (without the back shift operator).
(1 − B + 0.5B 2 )Xt = εt
(1 − 0.2B)(1 − 0.5B)Xt = εt
(1 − B 12 )Xt = (1 + 0.5B)εt
(1 − B)2 Xt = εt
3. The values X1 , . . . , X10 are known to us and are equal to 153, 189, 221, 215,
302, 223, 201, 173, 121, 106. For which values of t,
Yt = (1 − B)2 Xt
can be computed? What are they?
5. Second Order Autoregression

We say that a stationary process Xt with zero mean is a second order autore-
gression process, or AR(2), if there exists a white noise εt such that
(5.1) Xt + a1 Xt−1 + a2 Xt−2 = εt
for all t. In terms of the back shift operator, we can rewrite (5.1) as
(5.2) (1 + a1 B + a2 B 2 )Xt = εt
or
(5.3) α(B)Xt = εt
2
where α(x) = 1 + a1 x + a2 x .
Stationarity condition. As in the case of first order autoregression, we have
to impose a stationarity condition. Namely, we assume that
(5.4)
all roots of the polynomial α(z), real or complex, satisfy the condition |z| > 1,
that is, are outside of the unit circle. At first, this has little in common with the
stationarity condition (3.2) for the first order autoregression. However, we can
rewrite the equation (3.1) in terms of back shift operator as (1 − aB)Xt = εt , or
α(B)Xt = εt with α(z) = 1 − az. The only root of α(z) is equal to 1/a, so it is
greater than one in absolute value if and only if (3.2) holds. Moreover, one can
show (see details below) that, under the stationarity condition (5.4), the process
Xt can be represented as a general linear process. In particular, Xt is independent
from the future values of the noise εt+k , k = 1, 2, . . . .
Processes with non-zero mean. As in case of first order autoregression, a
stationary process Yt with mean µ is AR(2) if Xt = Yt − µ is an AR(2) with zero
mean. The equation (5.1) then implies that
(5.5) Yt + a1 Yt−1 + a2 Yt−2 = c + εt
68
where c = (1 + a1 + a2 )µ. So µ = EYt could be recovered from the equation (5.5)

as
c
(5.6) EYt =
1 + a1 + a2
We’ll focus on processes with zero expectation.
Triangle of Stationarity. How to translate the condition (5.4) into the lan-
guage of coefficients? It is more convenient to switch to the equation
(5.7) z 2 + a1 z + a2 = 0
Denote its roots by z1 and z2 . Since z 2 +a1 z+a2 = z 2 (1+a1 /z+a2 /z 2 ) = z 2 α(1/z),
the roots of (5.7) are reciprocal to the roots of α(z). Therefore (5.4) is equivalent
to the property
(5.8) |z1 | < 1, |z2 | < 1
If a21 ≥ 4a2 , then the roots z1 , z2 are real and therefore they must belong to the
interval (−1, 1). The quadratic function f (z) = z 2 + a1 z + a2 achieves its minimum
at the point z∗ = −a1 /2. For the roots to be real, the minimal value must be
negative or zero. So, both roots will be within the interval (−1, 1) if and only if the
point of the (negative) minimum is within (−1, 1) but the values at the endpoints
of the segment are strictly positive: |z∗ | < 1 and f (1) > 0, f (−1) > 0. This way we
get
(5.9) a21 ≥ 4a2 , |a1 | < 2, 1 + a1 + a2 > 0, 1 − a1 + a2 > 0
Next, if a21 < 4a2 then the roots are complex. We have
p
−a1 ± i 4a2 − a21
(5.10) z1,2 =
2
where i is the imaginary unit. Then
1 2
|z1 |2 = |z2 |2 =
(a + (4a2 − a21 )) = a2
4 1
So, we arrive at the following condition:
(5.11) a21 < 4a2 , a2 < 1
Combining (5.9) and (5.11) together, we get a triangle
(5.12) a2 < 1, a1 + a2 > −1 a1 − a2 < 1
(see Figure 17).
Stationarity and asymptotic stationarity. Let Xt be a stationary process
described by (5.1) and let Yt be another process that also follows the equation (5.1)
with the same noise εt . Let Wt = Xt − Yt . Similar to (3.6), we get
(5.13) Wt + a1 Wt−1 + a2 Wt−2 = 0.
The equation of the form (5.13) is called a difference equation (see Section C.3).
It is known that every solution to a difference equation (5.13) can be written in
terms of the roots z1 , z2 of the equation (5.7). In particular, if z1 6= z2 , then (5.13)
implies that
(5.14) Wt = C1 z1t + C2 z2t
69
a2
1
Complex Roots
0.5
a1
-2 -1 1 2
Real Roots
-0.5
-1
Figure 17. Triangle of Stationarity
for some appropriate constants C1 , C2 . If z1 = z2 = z∗ , then (5.14) has to be

replaced by
(5.15) Wt = C1 z∗t + C2 tz∗t .
However, if the stationarity condition (5.4) is satisfied, then |z1 | < 1, |z2 | < 1 and
therefore (5.14), (5.15) imply |Wt | → 0 as t → ∞. Hence Yt converges to the
stationary process Xt as t → ∞. So, if the stationarity condition (5.4) is satisfied,
then every process Yt that satisfies (5.1), is asymptotically stationary.
Variance and autocovariance function. Note that
Cov(εt , Xt ) = Cov(εt , εt − a1 Xt−1 − a2 Xt−2 )
= Var εt − a1 Cov(εt , Xt−1 ) − a2 Cov(εt , Xt−2 ) = σε2
since Xt is a general linear process (see the details below) and therefore εt is inde-
pendent from the past values of the process. Next,
Var Xt = Cov(Xt , Xt ) = Cov(Xt , εt − a1 Xt−1 − a2 Xt−2 )
= Cov(Xt , εt ) − a1 Cov(Xt , Xt−1 ) − a2 Cov(Xt , Xt−2 )
= σε2 − a1 R(1) − a2 R(2)
2
where R(k) stands for the autocovariance function. Since R(k) = σX ρ(k), we have
2 σε2
(5.16) σX = .
1 + a1 ρ(1) + a2 ρ(2)
Next,
R(1) = Cov(Xt−1 , Xt ) = Cov(Xt−1 , εt − a1 Xt−1 − a2 Xt−2 )
= Cov(Xt−1 , εt ) − a1 Cov(Xt−1 , Xt−1 ) − a2 Cov(Xt−1 , Xt−2 )
2
= 0 − a1 σX − a2 R(1).
70
2
Dividing by the variance σX , we get the following equation for the autocorrelation
function ρ:
(5.17) ρ(1) + a1 + a2 ρ(1) = 0
which implies
a1
(5.18) ρ(1) = − .
1 + a2
Next, for every k ≥ 2,
R(k) = Cov(Xt−k , Xt ) = Cov(Xt−k , εt − a1 Xt−1 − a2 Xt−2 )
= Cov(Xt−k , εt ) − a1 Cov(Xt−k , Xt−1 ) − a2 Cov(Xt−k , Xt−2 )
= 0 − a1 R(k − 1) − a2 R(k − 2)
which yields
(5.19) ρ(k) + a1 ρ(k − 1) + a2 ρ(k − 2) = 0.
The equation (5.19) is known under the name Yule—Walker equation. In particular,
it implies
a21
(5.20) ρ(2) = −a1 ρ(1) − a2 = − a2 .
1 + a2
Substituting (5.18) and (5.20) into (5.16), we get
2 (1 + a2 )σε2
(5.21) σX = .
(1 − a2 )(1 − a1 + a2 )(1 + a1 + a2 )
Note what happens to the variance as the coefficients approach the boundaries of
the triangle of stationarity.
Since ρ(0) = 1 and ρ(1) is known, (5.19) can be solved (see the details below).
Its solution can be expressed in terms of the roots z1 , z2 . If z1 6= z2 , then the
solution is given by the formula
(1 − z22 )z1k+1 − (1 − z12 )z2k+1
(5.22) ρ(k) = , k ≥ 0.
(z1 − z2 )(1 + z1 z2 )
If z1 = z2 = z∗ , then
1 − z∗2

(5.23) ρ(k) = z∗k 1 + k , k ≥ 0.
1 + z∗2
Note that, even if z1 , z2 are not real numbers, then they are complex conjugates and
(5.22) still results in a real number. Also, formulas (5.22) and (5.23) do not work
for negative k, but the relation ρ(k) = ρ(−k) allows us to find the corresponding
values.
Partial autocorrelation function of AR(2) process. As in case of the
first order autoregression, let us write down the determinant that stands in the
numerator (1.10). It is equal to

1 ρ(1) . . . ρ(1)
ρ(1) 1 . . . ρ(2)

ρ(2) ρ(1) . . . ρ(3)

... ... ... . . .

ρ(k − 1) ρ(k − 2) . . . ρ(k)
71
Assume k > 2. According to (5.17) and (5.19), if we multiply the first column
by −a1 and the second column by −a2 and add them up, then the result will
be precisely equal to the last column. Hence, the matrix is degenerate and its
determinant is equal to zero. Therefore
φ(k) = 0 for all k > 2
Also, φ(1) = ρ(1) = −a1 /(1 + a2 ) and
ρ(2) − ρ(1)2
φ(2) = = −a2
1 − ρ(1)2
Behavior of the process. If the roots z1 , z2 are real, then the behavior of
the processes, roughly, is similar to the behavior of AR(1) process that corresponds
to the biggest (in the absolute value) root. Interesting effects may occur when the
roots are complex. Let θ be such that
a1
(5.24) cos(θ) = − √
2 a2
(note that 4a2 > a21 for the roots to be complex, so a2 is positive and the expression
in (5.24) does not exceed one in absolute value). Then (5.10) can be rewritten in
the form
√ √
(5.25) z1,2 = a2 (cos(θ) ± i sin(θ)) = a2 e±iθ
and, with some help of complex exponents and trig identities (see below), (5.22)
implies
√ sin(kθ + ψ)
(5.26) ρ(k) = ( a2 )k
sin ψ
where ψ is such that
1 + a2
tan ψ = tan θ
1 − a2
So, ρ(k) can be represented as a product of a periodic function sin(kθ + ψ) and an
√
exponent ( a2 )k . As a result, the process has a quasiperiodic behavior (a cycle),
with period 2π/θ.
Examples. 1. Let us consider the equation
Xt − 0.5Xt−1 + 0.25Xt−2 = 3 + εt .
It describes a stationary process, for instance because its coefficients satisfy the
inequalities (5.12). According to (5.6), expectation of Xt is equal to 4. By (5.18)
and (5.20), its first autocorrelation ρ(1) = 0.4, second autocorrelation ρ(2) = −0.05
and σX2
= 80 2
21 σε . Next, corresponding roots z1 , z2 of the equation (5.7) are equal to
√
1±i 3 1
z1,2 = = e±iπ/3
4 2
(since their absolute values are equal to 0.5, we have another confirmation of sta-
tionarity). In order to write down a formula for a generic value ρ(k) of the au-
tocorrelation
√ function, we could either use the formula (5.26) (yes, θ = π/3, but
tan ψ = 5 3/3 is not inspiring), or solve the equation (5.19) directly. According to
Appendix C, Section 3, every solution to (5.19), in particular ρ(k), could be written
as k
1
ρ(k) = (C1 cos(kπ/3) + C2 sin(kπ/3))
2
72
where C1 , C2 are some constants. Setting k = 0, we get an equation

1 = C1 .
For k = 1, we get
1 1 √
0.4 = (C1 cos(π/3) + C2 sin(π/3)) = (1 + C2 3)
2 4
√
which implies C2 = 3/5. So,
√
1 k 3
ρ(k) = ( ) (cos(kπ/3) + sin(kπ/3)).
2 5
2. Let now
Xt − 0.25Xt−1 − 0.125Xt−2 = εt .
Once again, it is stationary because its coefficients satisfy the inequalities (5.12).
There is no constant in the equation, so the expectation of Xt is equal to zero.
By (5.18) and (5.20), its first autocorrelation ρ(1) = 2/7, second autocorrelation
2 448 2
ρ(2) = −11/56 and σX = 405 σε . Next, corresponding roots z1 , z2 of the equation
(5.7) are equal to
z1 = −0.25, z2 = 0.5
Note that absolute value of each of the roots does not exceed 1. In order to write
down a formula for a generic value ρ(k) of the autocorrelation function, we could
either use the formula (5.22) or again solve the equation (5.19). According to
Appendix C, Section 3, every solution to (5.19), in particular ρ(k), could be written
as k k
1 1
ρ(k) = C1 − + C2 , k≥0
4 2
where C1 , C2 are some constants. Setting k = 0, we get an equation
1 = C1 + C2 .
For k = 1, we get
2 C1 C2
=− +
7 4 2
Solving the equations, we get C1 = 2/7, C2 = 5/7 and therefore
2 (−1)k 5 1
ρ(k) = + , k ≥ 0.
7 4k 7 2k
More Examples. On Figures 18 - 23 you can see how the behavior of an
AR(2) process depends on the roots z1 , z2 of (5.7).
Details. 1. Derivation of (5.22) and (5.23) (sketch). Note that (5.19)
is a difference equation of the order 2, as described in Appendix C.3. Therefore
any sequence of values ρ(k) that satisfies (5.19), can be represented in terms of the
roots z1 , z2 of the equation (5.7). If the roots z1 , z2 are distinct, then
ρ(k) = C1 z1k + C2 z2k
where C1 , C2 are some constants. If the roots are equal, so that z1 = z2 = z∗ , then
ρ(k) = C1 z∗k + C2 kz∗k .
Those equations hold for all k ≥ 0. However, the first values of autocorrelation
function are already known to us (see (5.18)). Plugging in ρ(0) = 1 and ρ(1) =
73
Data
4
Time
20 40 60 80 100 120 140
-2
-4
-6
Figure 18. Simulated AR(2) process Xt = Xt−1 − 0.2Xt−2 + εt .

Both roots are positive, the biggest is 0.723.
Data
6
Time
20 40 60 80 100 120 140
-2
-4
-6
Figure 19. Simulated AR(2) process Xt = −Xt−1 − 0.2Xt−2 + εt .

Both roots are negative, the biggest in absolute value is −0.723.
Data
Time
20 40 60 80 100 120 140
-2
-4
Figure 20. Simulated AR(2) process Xt = 0.5Xt−1 +0.4Xt−2 +εt .

Real roots of opposite signs (0.93 and −0.43).
a1
− 1+a 2
, we get two linear equations for C1 and C2 that can be easily solved. Namely,
we get
1 a1 1 a1
C1 = (z2 + ), C2 = 1 − C1 = (z1 + )
z2 − z1 (1 + a2 ) z1 − z2 (1 + a2 )
in case of the distinct roots, and
a1
C1 = 1, C2 = −1 −
z∗ (1 + a2 )
74
Data
10
Time
20 40 60 80 100 120 140
-5
-10
Figure 21. Simulated AR(2) process Xt = 1.8Xt−1 −0.9Xt−2 +εt .

Complex roots, z1,2 = −0.9 ± 0.3i. Period of the corresponding
cycle is approximately 20.
ACF
1.0
0.5
Lags
1 5 10 15 20 25 30
-0.5
-1.0
Figure 22. Autocorrelation function for the AR(2) process Xt =

1.8Xt−1 − 0.9Xt−2 + εt .
PACF
1.0
0.5
Lags
1 5 10 15 20 25 30
-0.5
-1.0
Figure 23. Partial Autocorrelation function for the AR(2) pro-

cess Xt = 1.8Xt−1 − 0.9Xt−2 + εt .
in case of equal roots. Next, since

z 2 + a1 z + a2 = (z − z1 )(z − z2 ),
the coefficients a1 and a2 can also be expressed in terms of z1 and z2 . Namely,
a1 = −(z1 + z2 ), a2 = z1 z2
75
in case of distinct roots, or

a1 = −2z∗ , a2 = z∗2
in case of equal roots. For this reason, the above formulas for C1 and C2 can be
reduced to
z1 (1 − z22 ) z2 (1 − z12 )
C1 = , C2 =
(z1 − z2 )(1 + z1 z2 ) (z2 − z1 )(1 + z1 z2 )
in case of distinct roots, and to
1 − z∗2
C1 = 1, C2 =
1 + z∗2
in case of equal roots, which is equivalent to (5.22) and (5.23).
2. Derivation of (5.26) (sketch). Suppose the roots z1 , z2 are not real
numbers. Then they are complex conjugates, and
z1 z2 = |z1 |2 = a2 .
Hence, for some θ,
√ √
z1,2 = a2 e±iθ = a2 (cos θ ± i sin θ)
and
√ √
z1 + z2 = 2 a2 cos θ, z1 − z2 = 2i a2 sin θ
On the other hand, z1 + z2 = −a1 , which implies (5.24). Next,
z1l = (a2 )l/2 (cos(lθ) + i sin(lθ))
and
z2l = (a2 )l/2 (cos(lθ) − i sin(lθ))
For that reason,
z1k+1 − z2k+1 = 2i(a2 )(k+1)/2 sin((k + 1)θ)
and
z12 z22 (z1k−1 − z2k−1 ) = 2ia22 (a2 )(k−1)/2 sin((k − 1)θ)
√
Substituting those expressions into (5.22) and canceling 2i a2 , we get
sin((k + 1)θ) − a2 sin((k − 1)θ)
(5.27) ρ(k) = (a2 )k/2
(1 + a2 ) sin θ
Next,
sin((k ± 1)θ) = sin(kθ) cos(θ) ± cos(kθ) sin(θ)
Substituting those expressions into (5.27), we get
(1 − a2 ) sin(kθ) cos θ + (1 + a2 ) cos(kθ) sin θ
ρ(k) = (a2 )k/2
(1 + a2 ) sin θ
Dividing both numerator and denominator by (1 − a2 ) cos θ and using the definition
of ψ, we get
1+a2
k/2
sin(kθ) + 1−a2 cos(kθ) tan θ
ρ(k) = (a2 ) 1+a2
1−a2 tan θ
sin(kθ) + cos(kθ) tan ψ
= (a2 )k/2
tan ψ
sin(kθ) cos ψ + cos(kθ) sin ψ sin(kθ + ψ)
= (a2 )k/2 = (a2 )k/2
sin ψ sin ψ
76
as promised.
3. Stationarity condition and General Linear Process (sketch). Sup-
pose that the stationary process Xt satisfies the equation (5.3) and the polynomial
α(z) satisfies the stationarity condition (5.4). We want to show that the process
Xt can be represented as a general linear process.
It is more convenient to switch to the equation (5.7). Its roots z1 , z2 should
satisfy the condition (5.8). Suppose, first, that z1 , z2 are distinct and real. The
equation (5.7) implies that z1 z2 = a2 and z1 + z2 = −a1 . Therefore
α(x) = 1 + a1 x + a2 x2 = (1 − z1 x)(1 − z2 x)
and

1 1 1 1 z1 z2
= = = −
α(x) 1 + a1 x + a2 x2 (1 − z1 x)(1 − z2 x) z1 − z2 1 − z1 x 1 − z2 x
(partial fractions). This way, we get the following power series representation:
∞
1 1 X
(5.28) = (z1k+1 − z2k+1 )xk
α(x) z1 − z2
k=0
and the series converges if |z1 x| < 1 and |z2 x| < 1. By substituting the shift
operator B instead of x, we get a representation
∞
1 1 X
Xt = εt = (z1k+1 − z2k+1 )B k εt
α(B) z1 − z2
k=0
(5.29) ∞
1 X
= (z1k+1 − z2k+1 )εt−k
z1 − z2
k=0
Because of (5.8), the series on the right converges in mean squares and defines a
general linear process that satisfies (5.1).
If the roots are complex, the above arguments still hold but the expression on
the right side of (5.29) contains some complex numbers and has to be adjusted.
Indeed, then z1 and z2 are complex conjugates and therefore admit a representation
√ √
z1,2 = a2 e±iθ = a2 (cos θ ± i sin θ)
√ √
for some θ. Then z1 − z2 = 2i a2 sin θ and z1k+1 − z2k+1 = 2i( a2 )k+1 sin(k + 1)θ
and the formula (5.29) boils down
∞
X√
1 sin(k + 1)θ
(5.30) Xt = εt = ( a2 )k εt−k
α(B) sin θ
k=0
Finally, suppose now that the roots are equal: z1 = z2 = z∗ . Then
1 1
α(x) = 1 + a1 x + a2 x2 = (1 − z∗ x)2 , =
α(x) (1 − z∗ x)2
and
∞
1 X
(5.31) = (k + 1)z∗k xk
α(x)
k=0
This leads to the following power series representation:
∞ ∞
1 X X
(5.32) Xt = εt = (k + 1)z∗k B k εt = (k + 1)z∗k εt−k
α(B)
k=0 k=0
77
Again, this series converges in mean squares if |z∗ | < 1 and defines a general linear
process that satisfies (5.1).
The inverse is also true: if a stationary process Xt satisfies (5.1) and, for every
k > 0, Xt is independent from εt+k , then the condition (5.7) must be valid.
Exercises
Hint to Problems 1-5: You can just follow the examples from p.71 (you may
also find Section C.3 useful). Equally, you could verify that the values ρ(0), ρ(1)
are the correct ones, and that the Yule—Walker equations (5.19) are satisfied for
all k ≥ 2. Also, you could use (5.22), (5.23) or (5.26).
1. Verify that the equation
2 1
Xt + Xt−1 − Xt−2 = εt
15 15
describes a stationary series. Show that its ACF is given by the formula
k k
9 k 1 5 1
ρ(k) = (−1) + , k ≥ 0.
14 3 14 5
Find its PACF.
7 1
Xt − Xt−1 + Xt−2 = εt
6 3
k k
5 1 9 2
ρ(k) = − + , k ≥ 0.
4 2 4 3
Find its PACF.
2 4
Xt − Xt−1 + Xt−2 = εt
3 9
k
2 5
ρ(k) = cos(πk/3) + √ sin(πk/3) , k ≥ 0.
3 13 3
Find its PACF.
Xt − Xt−1 + 0.5Xt−2 = εt
k
1 1
ρ(k) = √ cos(πk/4) + sin(πk/4) , k ≥ 0.
2 3
Find its PACF.
Xt − Xt−1 + 0.25Xt−2 = εt
k
3 1
ρ(k) = (1 + k) , k ≥ 0.
5 2
78
Find its PACF.

5 1
Xt − Xt−1 − Xt−2 = 2 + εt
12 4
describe a stationary series? If yes, what is its expectation and variance? Find a
formula for its ACF and PACF.
7. For which a, the equation
Xt − aXt−1 + 0.5Xt−2 = εt
describes a stationary series? The same question about the equations
Xt − Xt−1 + aXt−2 = εt
and
Xt + aXt−1 + Xt−2 = εt
8*. (a) Verify that the process
Xt − 1.4Xt−1 + 0.48Xt−2 = εt
satisfies the stationarity condition. (b) Assuming Xt is stationary, represent process
Xt as a general linear process.
9*. Do the same for the process
Xt − Xt−1 + 0.5Xt−2 = εt
10*. Do the same for the process
Xt − 1.5Xt−1 + 0.525Xt−2 = εt
6. General Autoregression Model

We say that a stationary process Xt with zero expectation is an autoregression
process of the order k, or AR(k), if it satisfies the equation
(6.1) Xt + a1 Xt−1 + · · · + ak Xt−k = εt
where εt stands for the white noise εt . In terms of the shift operator, the equation
(6.1) can be rewritten as
(6.2) α(B)Xt = (1 + a1 B + · · · + ak B k )Xt = εt
where
α(x) = 1 + a1 x + · · · + ak xk .
Stationarity condition. Once again, we have to assume that all roots of the
polynomial α(z), real or complex, are outside of the unit circle, that is, satisfy the
condition |z| > 1. We refer to this property as the stationarity condition. As in
case of second order autoregression, this condition can be reformulated as follows:
all the roots, both real and complex, of the polynomial
(6.3) z k + a1 z k−1 + · · · + ak
satisfy the condition |z| < 1, that is, belong to the unit disk. If the stationary
condition is satisfied, then the process Xt can be represented as a general linear
process (see a sketch below). In particular, the process Xt is independent from the
future values of the noise εt+k , k > 0.
Asymptotic stationarity. As in case of AR(1) and AR(2) models, station-
arity condition implies that every process Yt that follows the equation (6.1), is
79
asymptotically stationary, that is, it converges to some stationary process Xt de-

scribed by the same equation.
Processes with non-zero expectation. As before, a stationary process Yt
with expectation µ is an AR(k) if Xt = Yt − µ is an AR(k) with zero expectation.
In terms of Yt , the equation (6.1) can be rewritten as
(6.4) Yt + a1 Yt−1 + · · · + ak Yt−k = c + εt
where c = (1 + a1 + · · · + ak )µ. Hence, µ = c/(1 + a1 + · · · + ak ).
Yule—Walker equations. Similar to AR(2), we can show that
2 σε2
(6.5) σX =
1 + a1 ρ(1) + · · · + ak ρ(k)
and
(6.6) ρ(t) + a1 ρ(t − 1) + · · · + ak ρ(t − k) = 0
for all t ≥ 1. The equation (6.6) is called the Yule—Walker equation. Since
ρ(0) = 1 and ρ(t) = ρ(−t), the Yule—Walker equations with t = 1, . . . , k − 1 allow
us to find all the first values ρ(1), . . . , ρ(k − 1). For t ≥ k, the equation (6.6) turns
out to be a difference equation (see Appendix C, section 3) and therefore ρ(t) must
be a linear combination of exponents and damped sine functions.
The partial autocorrelation function of AR(k) process, has the property
(6.7) φ(l) = 0 if l ≥ k + 1.
Example. Consider the equation
Xt − Xt−1 + 0.5Xt−2 − 0.125Xt−3 = εt .
√
Corresponding polynomial α(z) = 1 − z + 0.5z 2 − 0.125z 3 has roots 2 and 1 ± i 3.
Since absolute values of all of the roots are equal to 2, the process is stationary. It
has an expectation zero (since no constant is involved). In order to find the first two
values of the autocorrelation function, we have to use the Yule—Walker equations.
Setting t = 1, we get the equation
ρ(1) − ρ(0) + 0.5ρ(−1) − 0.125ρ(−2) = 0.
Since ρ(0) = 1 and ρ(−1) = ρ(1), ρ(−2) = ρ(2), it is equivalent to
1.5ρ(1) − 0.125ρ(2) = 1
Next, setting t = 2, we get
ρ(2) − ρ(1) + 0.5ρ(0) − 0.125ρ(−1) = 0
which is equivalent to
−1.125ρ(1) + ρ(2) = −0.5
Solving for ρ(1) and ρ(2), we get
20 8
ρ(1) = , ρ(2) = .
29 29
We can now find
13
ρ(3) = ρ(2) − 0.5ρ(1) + 0.125ρ(0) =
232
80
and so on. Using those values and the formula (1.10), we can now find first three
values of the partial autocorrelation function (this is an AR(3) model, so φ(k) = 0
for k ≥ 4). Namely,
20 ρ(2) − ρ(1)2 8
φ(1) = ρ(1) = , φ(2) = =−
29 1 − ρ(1)2 21
and
ρ(3) + ρ(1)ρ(2)2 + ρ(1)3 − 2ρ(1)ρ(2) − ρ(1)2 ρ(3) 1
φ(3) = =
1 + 2ρ(1)2 ρ(2) − ρ(2)2 − 2ρ(1)2 8
In order to find a closed form expression for ρ(k), we have to solve the Yule-Walker
equation
(6.8) ρ(k) − ρ(k − 1) + 0.5ρ(k − 2) − 0.125ρ(k − 3) = 0, k≥3
To this end, we have to find the roots of the auxiliary equation
z 3 − z 2 + 0.5z − 0.125 = 0
√
They are equal to z1 = 1/2 and z2,3 = 41 ± i 43 , all three have the absolute value
1/2. According to Appendix C.3, general solution to the equation (6.8) is given by
the formula
1 1 kπ kπ
ρ(k) = C1 ( )k + ( )k (C2 cos( ) + C3 sin( ))
2 2 3 3
(the first term is related to z1 and the rest is related to z2,3 ). Plugging in k = 0, 1, 2,
we get three equations
1 = C1 + C2 (k = 0)
√
20 1 1 3
= (C1 + C2 + C3 ) (k = 1)
29 2 2 √2
8 1 1 3
= (C1 − C2 + C3 ) (k = 2)
29 4 2 2
Solving the equations, we get
√
21 8 10 3
C1 = , C2 = , C3 =
29 29 29
and therefore
√ !
1 21 8 kπ 10 3 kπ
ρ(k) = k + cos( ) + sin( )
2 29 29 3 29 3
for k ≥ 3.
Details. 1. Stationarity Condition and General Linear Processes. Let
1
the roots of α(z) satisfy the condition |z| > 1. It could be shown that α(z) can be
represented as a sum of its Maclaurin series
1
= 1 + C1 z + C2 z 2 + . . .
α(z)
with the radius of convergence that is greater than 1. It follows either from the
theory of analytic functions, or from the same trick that we have used with AR(2):
we factorize
α(x) = (1 − z1 x) . . . (1 − zk x)
where z1 , . . . , zk are the roots of the equation (6.3). After that, we apply the partial
fractions and the formula for the sum of a geometric series.
81
P
In any case, it means that the series Ci converges absolutely, and therefore
1
(6.9) Xt = εt = (1 + C1 B + C2 B 2 + . . . )εt
α(B)
is a well-defined general linear process. It could be verified that Xt satisfies (6.1).
For this reason, Xt is independent from the future values of the noise εt+k , k > 0.
Another way to explain the role of the stationarity condition is to consider the
equation (6.1) without noise:
(6.10) Yt + a1 Xt−1 + · · · + ak Yt−k = 0
The equation (6.10) is a difference equation of the order k, and its general solution
can be written (Appendix C, section 3) in terms of the roots z1 , . . . , zk of (6.3) . If
their absolute values |z1 | < 1, . . . |zk | < 1, then Yt → 0 as t → ∞ no matter what
are the initial values Y0 , . . . , Yk−1 . It could be shown that, with added noise, the
process Yt converges to a stationary process determined by the right side of the
equation (6.9) no matter what are Y0 , . . . , Yk−1 , and therefore Yt is asymptotically
stationary.
2. Derivation of (6.5) and (6.6). We begin with the formula
Xt = εt − a1 Xt−1 − . . . ak Xt−k .
Since εt is independent from past values of the process,
Cov(Xt , εt ) = σε2
and
k
X k
X
2
σX = Cov(Xt , Xt ) = Cov(Xt , εt ) − ak Cov(Xt , Xt−j ) = σε2 − ak R(j)
j=1 j=1
2
Next, R(j) = σX ρ(j) and therefore
2
σX (1 + a1 ρ(1) + · · · + ak ρ(k)) = σε2
which implies (6.5). Next, let i > 0. Then
k
X k
X
R(i) = Cov(Xt−i , Xt ) = Cov(Xt−i , εt )− ak Cov(Xt−i , Xt−j ) = − ak R(i−j).
j=1 j=1
2
Dividing by σX , we get (6.6).
3. Derivation of (6.7) (sketch). Let l > k. According to (1.10), φ(l)
is a quotient of two determinants. Like in AR(2) case, Yule—Walker equation
(6.6) implies that the determinant in the numerator is equal to zero, because if
we multiply the first k columns by the weights a1 , . . . , ak , and subtract the last
column, we get zero.
Exercises
Xt + aXt−1 − Xt−2 − aXt−3 = εt
describe a stationary series for any a?
2. (a) Verify that the equation
(1 − 0.5B)(1 − B + 0.5B 2 )Xt = −1 + εt
82
describes a stationary process. What is its expectation and variance? (b) Write the
Yule—Walker equations (6.6) for t = 1, 2, 3 and solve them to find ρ(1), ρ(2), ρ(3).
3*. For the same equation, find a formula for ρ(k).
4*. Does the equation
(1 + B + 0.5B 2 )(1 + 0.5B + 0.25B 2 )Xt = εt
describe a stationary process? If yes, what is its expectation and variance? Write
the Yule—Walker equations (6.6) for t = 1, 2, 3, 4 and solve them to find ρ(1), ρ(2),
ρ(3) and ρ(4).
7. Moving Average Processes Revisited. Invertibility Condition

Let Xt be given by the formula,
(7.1) Xt = εt + b1 εt−1 + · · · + bl εt−l
where εt the white noise. As we have seen in Section 2, Xt is a stationary process
with zero expectation. Denote
β(x) = 1 + b1 x + · · · + bl xl
In terms of the back shift operator, the equation (7.1) can be rewritten as
(7.2) Xt = β(B)εt
Consider now a reciprocal function 1/β(z). It can be expanded into a power
series (Maclaurin series),
1
(7.3) = 1 + D1 z + D2 z 2 + . . .
β(z)
Equivalently, we can write
(7.4) 1 = (1 + D1 z + D2 z 2 + . . . )β(z)
We say that the process Xt satisfies the invertibility condition if all the roots
of the polynomial β(z) are outside of the unit circle, that is, satisfy the condition
|z| > 1. Under this condition, the radius of convergence of the series (7.3) is greater
than 1. To see that, we can either refer to the theory of analytic functions, or
factorize β(z), do partial fractions in order to represent 1/β(z) as a sum of simple
fractions, and then figure out a power series representation for each of them, as we
did in case of AR(2) and AR(k) processes.
Applying the representation (7.4) to both sides of (7.2), we get a representation
(7.5) εt = (1+D1 B +D2 B 2 +. . . )β(B)εt = (1+D1 B +. . . )Xt = Xt +D1 Xt−1 +. . .
The convergence of the series on the right could be verified though it is a delicate
question, because Xt are not independent. So, Xt could be viewed as an autore-
gression process of infinite order.
Example. To illustrate how it works, let us find a representation (7.5) for a
model
Xt = εt − 0.1εt−1 − 0.2εt−2 .
We have
β(z) = 1 − 0.1z − 0.2z 2
The equation β(z) = 0 has the roots z = 2 and z = −2.5 so the model is invertible.
Moreover,
β(z) = (1 − 0.5z)(1 + 0.4z)
83
and therefore
1 1 5 1 4 1
= = +
β(z) (1 − 0.5z)(1 + 0.4z) 9 1 − 0.5z 9 1 + 0.4z
(partial fractions). Now,

∞
1 X
= 1 + 0.5z + (0.5)2 z 2 + · · · = (0.5)k z k
1 − 0.5z
k=0
and
∞
1 X
= 1 − 0.4z + (0.4)2 z 2 − · · · = (−1)k (0.4)k z k
1 + 0.4z
k=0
Multiplying those series by 5/9 and 4/9 and adding them up, we get the following
representation
∞
5(0.5)k + 4(−1)k (0.4)k

1 X
= zk
β(z) 9
k=0
So, the equation (7.5) takes the form

∞
5(0.5)k + 4(−1)k (0.4)k
X
εt = Xt−k = Xt + 0.1Xt−1 + 0.21Xt−2 + . . .
9
k=0
Comparison of autoregressive and moving average models. Suppose

Xt is a first order autoregression, described by the equation (3.10). Then ρ(1) = a.
Since a could be any number between −1 and 1, there is no restrictions on ρ(1)
other than |ρ(1)| < 1. The same is true for the general autoregression process
AR(k) (there is no restrictions on the first k values of the autocorrelation function
other than positive semidefiniteness (1.8)). However, if Xt is MA(1), then
b
ρ(1) =
1 + b2
and therefore |ρ(1)| ≤ 0.5 (moreover, ρ(1) = ±0.5 corresponds to b = ±1, and
the corresponding process does not satisfy the invertibility condition). The same
is √
true for higher order models: for MA(2), the biggest possible value for ρ(1)
√ ≈ 0.707 (Problem 3); for MA(3), the biggest possible value for√ρ(1) is
is 2/2
(1 + 5)/4 ≈ 0.809 (Problem 4); for MA(4), the biggest possible value is 3/2 ≈
0.866. It could be shown that the biggest possible value of |ρ(1)| for a MA(l) model
is equal to cos(π/(l + 2)).
Details. Convergence of the series (7.5) (sketch)*. Since the radius of
convergence of the series (7.3) is greater than one, the series converges absolutely
at z = 1 and therefore
X
|Dk | < ∞.
Denote by Sn,t the partial sum of the series (7.5), that is
Sn,t = Xt + D1 Xt−1 + D2 Xt−2 + · · · + Dn Xt−n

84
We are going to show that the sequence of random variables Sn,t converges in mean
squares as n → ∞. Indeed,
m
X m
X
Var(Sn,t − Sm,t ) = Cov( Di Xt−i , Dj Xt−j )
i=n+1 j=n+1
m
X X m
= Di Dj Cov(Xt−i , Xt−j )
i=n+1 j=n+1
Xm Xm
= Di Dj R(i − j)
i=n+1 j=n+1
Xm m
X
2
≤( |Di |)( |Dj |)σX
i=n+1 j=n+1
Xm
=( |Di |)2 σX
2
i=n+1
2
P
because |R(k)| ≤ σXfor all k. Since the series |Dk | converges, random variables
Sn,t form a Cauchy sequence in mean squares and therefore, converge in mean
squares (see Appendix A, section 3).
Exercises
1. Which of the following MA models satisfy the invertibility condition?
Xt = εt − εt−1 + 2εt−2
Xt = εt + εt−1 − 0.5εt−2
Xt = εt − 0.6εt−1 + 0.8εt−2
Xt = εt + εt−1 + εt−2
2. For the following MA(2) model
2 1
Xt = εt + εt−1 − εt−2 ,
15 15
find the representation (7.5), that is find the formula for the coefficients.
3. Do the same for model
Xt = εt − εt−1 + 0.5εt−2 .
4. Do the same for model
Xt = εt + εt−1 + 0.25εt−2 .
Hint: You may find the following formula useful:
1
= 1 + 2q + 3q 2 + · · · + kq k−1 + . . .
(1 − q)2
√
√ 5*. Show that |ρ(1)| ≤ 2/2 for MA(2) models, and the models with ρ(1) =
± 2/2 are not invertible. √
√ that |ρ(1)| ≤ (1 + 5)/4 for MA(3) models, and the models with
6**. Show
ρ(1) = ±(1 + 5)/4 are not invertible.
85
8. General ARMA Processes

We say that a stationary process Xt with zero expectation is a mixed autoregression—
moving average process, or ARMA(k, l) if it satisfies the equation
(8.1) Xt + a1 Xt−1 + · · · + ak Xt−k = b0 εt + · · · + bl εt−l
where εt is a white noise. Parameters k and l (the order of autoregression and the
order of moving average) define the structure of the model.
Denote
α(x) = 1 + a1 x + · · · + ak xk
β(x) = 1 + b1 x + · · · + bl xl
In terms of the shift operator, (8.1) can be rewritten as
(8.2) α(B)Xt = β(B)εt
Stationarity and Invertibility Conditions. We assume that all the roots
of the polynomials α(x) and β(x) satisfy the condition |z| > 1, that is, are located
outside the unit disk. In other words, α(x) should satisfy the stationarity condi-
tion, and β(x) should satisfy the invertibility condition. Without the stationarity
condition, the process can’t be represented as a general linear process. Without
the invertibility condition, the analysis is very hard. With those conditions,
1
= C0 + C1 z + C2 z 2 + . . .
α(z)
1
= D0 + D1 z + D2 z 2 + . . .
β(z)
and the equation (8.1) can be rewritten as
(8.3) Xt = β(B)(C0 + C1 B + C2 B 2 + . . . )εt
(general linear process) or as
(8.4) α(B)(D0 + D1 B + D2 B 2 + . . . )Xt = εt
(autoregression of infinite order). In particular, (8.3) implies the independency of
Xt and εt+k , k > 0.
Processes with non-zero mean. As in case of autoregression models, a
stationary process Yt with expectation µ is an ARMA(k, l) if Xt = Yt − µ is an
ARMA(k, l) with zero mean. In that case,
(8.5) Yt + a1 Yt−1 + · · · + ak Yt−k = c + b0 εt + · · · + bl εt−l
where c = (1 + a1 + · · · + ak )µ.
We’ll focus on processes with zero expectation.
Yule—Walker equations. From (8.1),
Cov(Xt , Xt−m ) + a1 Cov(Xt−1 , Xt−m ) + · · · + ak Cov(Xt−k , Xt−m )
(8.6)
= b0 Cov(εt , Xt−m ) + · · · + bl Cov(εt−l , Xt−m )
In particular, if m ≥ l + 1, then the expression on the right is equal to zero, and
therefore
ρ(m) + a1 ρ(m − 1) + · · · + ak ρ(m − k) = 0
86
So, for large m, ρ(m) is a linear combination of exponents and damped sine func-
tions, as it were for AR(k) models. In order to find initial values of the autocovari-
ance/autocorrelation function, we have to evaluate the right side of (8.6), that is
to compute Cov(εt , Xs ) for t − l ≤ s ≤ t. One of the possible ways to do that is to
use (8.3).
ARMA(1,1). Let us consider a special case ARMA(1,1) (k = l = 1) in more
detail. A traditional way to write down the corresponding equation is as follows:
Xt − aXt−1 = εt + bεt−1
or
(1 − aB)Xt = (1 + bB)εt
The conditions of stationarity and invertibility imply |a| < 1, |b| < 1. We have
α(z) = 1 − az
and therefore
1
= 1 + az + a2 z 2 + . . .
α(z)
Hence, (8.3) becomes
∞
X
Xt = (1 + bB)(1 + aB + a2 B 2 + . . . )εt = (1 + ak−1 (a + b)B k )εt
k=1
With help of (2.9), (2.10) and (2.11), we get
∞
2
X (a + b)2
σX = σε2 (1 + a2k−2 (a + b)2 ) = σε2 (1 + ),
1 − a2
k=1
∞
X a(a + b)
R(l) = σε2 (al−1 (a + b) + a2k+l−1 (a + b)2 ) = σε2 al−1 (a + b)(1 + ),
1 − a2
k=1
if l ≥ 1, and
al−1 (a + b)(1 + ab)
(8.7) ρ(l) = , l ≥ 1.
1 + 2ab + b2
In particular, ρ(l) decays at exponential rate for l ≥ 1. The same is true for the
partial autocorrelation function (though the computation is difficult).
Dangers of the ARMA models. Suppose Xt = εt is just a white noise.
Then, for any a,
(8.8) Xt − aXt−1 = εt − aεt−1
which looks formally as an ARMA(1,1) process. Generalizing, if the polynomials
α(x) and β(x) have a common root, the orders k and l can be reduced by 1 (we
factorize α and β in the equation (8.2) and cancel the common factors).
If we don’t notice that and try to fit ARMA(1,1) model to the data that are ac-
tually described by (8.8), the estimation may produce strange results (for instance,
all parameters are statistically insignificant) or may even fail.
If the roots do not coincide but are close to each other, then the order of the
model can’t be formally reduced. But, still, the estimation may go wrong. Also, in
order to find a statistically significant difference between a model with close roots
and a reduced model, we need very long data sets.
Advantages of the ARMA models. As we pointed out already, the ARMA
model can be rewritten as an autoregression or moving average model of infinite
87
Data
Time
20 40 60 80 100 120 140
-5
Figure 24. Simulated ARMA(1,1) process Xt = 0.9Xt−1 + εt + 0.9εt−1 .
ACF
1.0
0.5
Lags
1 5 10 15 20 25 30
-0.5
-1.0
Figure 25. Autocorrelation function for the ARMA(1,1) process

Xt = 0.9Xt−1 + εt + 0.9εt−1 .
PACF
1.0
0.5
Lags
1 5 10 15 20 25 30
-0.5
-1.0
Figure 26. Partial Autocorrelation function for the ARMA(1,1)

process Xt = 0.9Xt−1 + εt + 0.9εt−1 .
order (see (8.4) and (8.3)) and therefore can be approximated by an autoregression
or moving average model with sufficiently large order. However, ARMA model
typically has less parameters to estimate.
In fact, ARMA models are very flexible. As we will see later, they may approx-
imate every stationary process with continuous spectral density.
Examples. On Figures 24 - 26 you can see the graph of a simulated ARMA(1,1)
process Xt = 0.9Xt−1 + εt + 0.9εt−1 , together with its ACF and PACF. Figures 27
88
Data
Time
20 40 60 80 100 120 140
-1
-2
Figure 27. Simulated ARMA(1,1) process Xt = 0.9Xt−1 + εt −

0.85εt−1 . Parameters a and b are such that the corresponding roots
almost coincide. The series is very close to a white noise, as could
be seen from its ACF and PACF below. To tell such a series from
a white noise, we need, at least, a couple of thousands of observa-
tions.
ACF
1.0
0.5
Lags
1 5 10 15 20 25 30
-0.5
-1.0
Figure 28. Autocorrelation function for the ARMA(1,1) process

Xt = 0.9Xt−1 + εt − 0.85εt−1 .
PACF
1.0
0.5
Lags
1 5 10 15 20 25 30
-0.5
-1.0
Figure 29. Partial Autocorrelation function for the ARMA(1,1)

process Xt = 0.9Xt−1 + εt − 0.85εt−1 .
- 29 show the graph of the process Xt = 0.9Xt−1 + εt − 0.85εt−1 with close roots,
and its ACF and PACF. As we can see, the process is very close to the white noise.
Exercises
89
1. Which of the following models satisfy the stationarity and invertibility

conditions?
Xt = Xt−1 + εt − εt−1 − 0.5εt−2
Xt = Xt−1 − 0.25Xt−2 + εt − εt−1 + 0.5εt−2
Xt = −Xt−1 − Xt−2 + εt − εt−1 + εt−2
2. For the ARMA(1,1) model
Xt = 0.5Xt−1 + εt + 0.25εt−1 ,
compute first three values of ACF (that is, ρ(1), ρ(2), ρ(3)) and the first three values
of PACF (that is, φ(1), φ(2), φ(3)). [Hint: Use (8.7).]
3*. For the ARMA(1,2) model
Xt − 0.9Xt−1 = 0.5 + εt + εt−1 + 0.25εt−2
find the formula for ACF. [Hint: Try to represent it as a general linear process and
use (2.11).]
4*. Do the same for ARMA(2,1) model
Xt − Xt−1 + 0.5Xt−2 = εt + 0.6εt−1 .
9. Non Stationary Processes and ARIMA (Box—Jenkins) Models

It might happen that the process Xt is not stationary, but its increments Yt =
∇Xt = Xt − Xt−1 (or, maybe, second increments) form a stationary process. With
luck, this new process could be an autoregression process, or maybe a moving
average, or maybe a mixed one. This way, we arrive at the following definition.
We say that the process Xt is a mixed autoregression and integrated moving
average, or ARIMA(k, d, l) process (here k, d, l are non-negative integers) if its dif-
ference of the order d is ARMA(k, l) process, possibly, with a non-zero expectation:
Yt = ∇d Xt
(9.1)
α(B)(Yt − µ) = β(B)εt
Here α(z) and β(z) are polynomials of the degrees k and l, respectively, and εt
stands for the white noise. Also, we assume that α and β satisfy the stationarity
and invertibility conditions.
Whenever d > 0, the process is not stationary. In terms of the original process
Xt , the equation (9.1) could be rewritten as
(9.2) (1 − B)d α(B)Xt = c + β(B)εt
where, as usual, c = (1 + a1 + · · · + ak )µ = α(1)µ. Though the equation (9.2) looks
similar to the equation (8.5), the polynomial A(B) = (1 − B)d α(B) that stands on
the left, does not satisfy the stationarity condition: it has 1 as one of the roots,
with multiplicity d. So, in order to recognize (9.2) as an ARIMA model, we have to
factor out (1 − B)d which gives us the order of the difference, and make sure that
the remaining polynomials α(x) and β(x) satisfy the stationarity and invertibility
conditions.
Examples. 1. ARIMA(0,1,0). This is the simplest possible example of
ARIMA process. In this case, (9.1) can be rewritten as follows:
(9.3) ∇Xt = εt
90
or
(9.4) Xt = Xt−1 + εt
(we assume µ = 0). Iterating (9.4), we get
Xt = X0 + ε1 + ε2 + · · · + εt
(sums of i.i.d. random variables). The process is not stationary, however it does
not have a trend (that is, it can’t be decomposed into the sum of a noise and a
non-random function).
2. On contrast, suppose that
Xt = a + bt + εt
(linear trend plus white noise). Then
Yt = ∇Xt = b + εt − εt−1
which can be rewritten as
Yt − b = (1 − B)εt
which looks like (9.1) with one exception: β(z) = 1 − z does not satisfy the invert-
ibility condition (its only root is equal to 1).
This can be generalized as follows. Suppose that
Xt = a + bt + ηt
where ηt is an ARMA(k, l) process described by the equation
α(B)ηt = β(B)εt .
Then
Yt = ∇Xt = b + ∇ηt ,
and ζt = ∇ηt follows the equation
α(B)ζt = (1 − B)β(B)εt .
and therefore does not satisfy the invertibility condition because (1 − z)β(z) defi-
nitely has a root at 1. The same applies to any polynomial trends, say, parabolic
(of course, in order to eliminate a parabolic trend, you have to switch to second
increments, that is, apply the difference operator twice). So, an ARMA process
plus deterministic trend can’t be described by an ARIMA model. On Figures 32 and
33 you can see a simulated process of this type.
3. Consider now an equation
(9.5) Xt − 1.25Xt−1 + 0.25Xt−2 = εt
It can be rewritten as
A(B)Xt = εt
where
A(z) = 1 − 1.25z + 0.25z 2 = (1 − z)(1 − 0.25z)
The equation A(z) = 0 has roots z = 1 and z = 4 and therefore it does not satisfy
the stationarity condition. However, A(z) can be viewed as the product
A(z) = (1 − z)α(z)
where
α(z) = 1 − 0.25z
91
Data
50
45
40
35
Time
0 20 40 60 80 100 120 140
Figure 30. Simulated ARIMA(1,1,0) process described by the

equation (9.5).
Data
10
Time
20 40 60 80 100 120 140
-5
-10
-15
Figure 31. Simulated ARIMA(1,2,1) process described by the

equation (1 + 0.95B)(1 − B)2 Xt = (1 − 0.7)εt .
satisfies the stationarity condition. Hence, the equation (9.5) can be rewritten as
follows
(1 − 0.25B)∇Xt = εt
or
(1 − 0.25B)Yt = εt
where Yt = ∇Xt is the first increment of Xt . The last equation describes an AR(1)
process with zero expectation, so the original equation (9.5) actually describes an
ARIMA(1,1,0) model. On Figure 30 you can see a simulated process that follows
the equation (9.5). Figure 31 shows an example of a simulated ARIMA(1,2,1)
process.
4. A slight modification of the previous example,
(9.6) Xt − 1.25Xt−1 + 0.25Xt−2 = εt + 2εt−1 ,
produces a model that does not satisfy the invertibility condition. It can still be
written as (1 − 0.25B)∇Xt = (1 + 2B)εt which looks like ARIMA(1,1,1) model, but
β(z) = 1 + 2z has the root z = 0.5.
5. Consider now an equation
Xt − 2Xt−1 + Xt−2 = εt
92
Data
40
30
20
10
Time
20 40 60 80 100 120 140
Figure 32. Simulated ARMA(1,0) process plus linear trend, as

an example of a series that can’t be represented as ARIMA process.
Data
44
42
40
38
36
34
32
Time
0 20 40 60 80 100 120 140
Figure 33. Simulated ARMA(1,0) process plus parabolic trend

as an example of a series that can’t be represented as ARIMA
process. Compare it with the Figure 31.
It can be rewritten as A(B)Xt = εt where A(z) = 1 − 2z + z 2 = (1 − z)2 . Therefore

the original equation can be rewritten as
∇2 Xt = εt
or
Yt = εt
2
where Yt = ∇ Xt is the second difference of the process Xt . Therefore it is an
ARIMA(0,2,0) process.
Exercises
1. Does the following equation describe an ARMA process? If yes, what are
k, l? Does it describe an ARIMA process? If yes, what are the structure parameters
(that is k, d, l?)
Xt − 0.5Xt−1 − 0.5Xt−2 = εt + 0.7εt−1
2. Same questions about the equation
Xt − 0.5Xt−1 − 0.5Xt−2 = 3 + εt + 0.7εt−1
3. Same for the equation
Xt = 2Xt−1 − Xt−2 + εt + εt−1 + 0.5εt−2
93
4*. Same for the equation

Xt − 2Xt−1 + 1.25Xt−2 − 0.25Xt−3 = εt
10. Prediction in a Stationary Model

Suppose Xt is a stochastic process. We’d like to construct an m-step ahead
forecast, that is a prediction for XN +m based on the data X1 , X2 , . . . , XN . More
precisely, we’d like to choose a predictor — a function
X̂N +m (N ) = X̂N +m (N )(X1 , . . . , XN )
which is a reasonable approximation for XN +m . In order to be specific, we would
like to minimize the expectation
E(X̂N +m (N ) − XN +m )2
of the square of the prediction error.
It turns out that this problem has a simple solution, provided we know the
distribution of the process. Namely, the best m steps ahead predictor equals to the
conditional expectation of XN +m given all the past values of the process, that is
(10.1) E(XN +m |XN , XN −1 , . . . , X1 ).
In general, the computation of the conditional expectation requires information
about the joint distribution of the values of the process, which is not possible unless
we know the distribution of the noise. However, conditional expectation has certain
properties (see Appendix A). In particular,
1. If W is independent from Y1 , . . . , Yn , then
E(W |Y1 , . . . , Yn ) = EW
2. If W = f (Y1 , . . . , Yn ) is a function of Y1 , . . . , Yn , then
E(f (Y1 , . . . , Yn )|Y1 , . . . , Yn ) = f (Y1 , . . . , Yn ).
With those in mind, let us consider some
Examples. 1. AR(1). Suppose the process Xt is an AR(1) process with zero
mean
Xt = aXt−1 + εt
where the white noise εt is independent from the past values of the process Xt−1 ,
Xt−2 , . . . . We’d like to construct an m steps ahead forecast X̂N +m (N ), that is a
prediction for XN +m given all the past Xs , s ≤ N . Since
XN +1 = aXN + εN +1 ,
we have
X̂N +1 (N ) = E(XN +1 |XN , XN −1 , . . . , X1 ) = aE(XN |XN , XN −1 , . . . , X1 )
+ E(εN +1 |XN , XN −1 , . . . , X1 ).
However,
E(XN |XN , XN −1 , . . . , X1 ) = XN
because XN is a function of XN , and
E(εN +1 |XN , XN −1 , . . . , X1 ) = EεN +1 = 0
because εN +1 is independent from X1 , . . . , XN . Therefore
X̂N +1 (n) = aXN .
94
In order to find an m step ahead prediction X̂N +m (N ), note that

(10.2) XN +m = εN +m + aεN +m−1 + · · · + am−1 εN +1 + am XN
Therefore
X̂N +m (N ) = E(XN +m |XN , XN −1 , . . . , X1 )
= E(εN +m |XN , XN −1 , . . . , X1 ) + aE(εN +m−1 |XN , XN −1 , . . . , X1 ) + . . .
+ am−1 E(εN +1 |XN , XN −1 , . . . , X1 ) + am E(XN |XN , XN −1 , . . . , X1 )
Since the noise εN +k , k > 0 is independent from Xs , s ≤ N , we have
E(εN +k |XN , XN −1 , . . . , X1 ) = 0,
and
E(XN |XN , XN −1 , . . . , X1 ) = XN
as above. Therefore
X̂N +m (N ) = am XN .
Remark. Note that
εN +1 = XN +1 − aXN = XN +1 − X̂N +1 (N ).
So, the noise εt is, in fact, the one step ahead prediction error at time t. This
property actually holds for all ARMA(k, l) processes provided we do have the in-
vertibility condition.
2. AR(k). In a similar way, we can handle an autoregression of the order k.
Indeed, let
Xt + a1 Xt−1 + · · · + ak Xt−k = εt
Replacing t by N + 1, we get
XN +1 + a1 XN + · · · + ak XN −k+1 = εN +1
As above, we can take the conditional expectation of both sides given X1 , . . . , XN
and get
X̂N +1 (N ) + a1 XN + · · · + ak XN −k+1 = 0,
so that
X̂N +1 (N ) = −a1 XN − · · · − ak XN −k+1
(no surprise). Further,
XN +2 + a1 XN +1 + a2 XN + · · · + ak XN −k+2 = εN +2
Once again, taking the conditional expectation of both sides, we get
X̂N +2 (N ) + a1 X̂N +1 (N ) + a2 XN + · · · + ak XN −k+2 = 0
that is
X̂N +2 (N ) = −a1 X̂N +1 (N ) − a2 XN · · · − ak XN −k+2
and so on.
3. MA(1). Let Xt = εt + bεt−1 be a MA(1) process with zero expectation.
We also assume that Xt is invertible (in case of MA(1), it means that |b| < 1).
Let’s begin with an easier part and find a two steps ahead prediction. We have
XN +2 = εN +2 + bεN +1
and therefore
X̂N +2 (N ) = E(εN +2 |XN , XN −1 , . . . , X1 ) + bE(εN +1 |XN , XN −1 , . . . , X1 ) = 0
95
because, once again, the future values of the noise are independent from the past
values of the process. For the same reason,
X̂N +m (N ) = 0 whenever m ≥ 2
Finding X̂N +1 (N ) is a different story. To simplify things, assume that we actually

know infinite number of past values of the process. We note that
εN +1 = XN +1 − bεN
= XN +1 − bXN + b2 εN −1
= XN +1 − bXN + b2 XN −1 − b3 εN −2
...
∞
X
= XN +1 − (−1)k bk+1 XN −k
k=0
We can now take conditional expectations of both sides given XN , XN −1 , . . . . Since

εN +1 is independent from the past values of the process, we get
∞
X
0 = X̂N +1 (N ) − (−1)k bk+1 XN −k .
k=0
or
∞
X
X̂N +1 (N ) = b (−1)k bk XN −k
k=0
4. MA(l). In a similar way, we can treat moving average of order l. As above,

we have
X̂N +m (N ) = 0 whenever m ≥ l + 1
(we assume that the expectation of the process is equal to zero). One, two, etc.,
l steps ahead predictions could be found if we use the invertibility condition and
represent Xt as an autoregression of infinite order. They are equal to weighted
sums of past values of the process.
5. ARMA(k, l). In order to handle a general ARMA(k, l) model, we should
use the invertibility condition and rewrite it as an autoregression of the infinite
order. For instance, consider an ARMA(1,1) process described by the equation
(10.3) Xt − aXt−1 = εt + bεt−1
Once again, assume that we actually know infinitely many past values of the process.
Similar to MA(1) case, we have
εN = XN − aXN −1 − bεN −1
= XN − aXN −1 − b(XN −1 − aXN −2 − bεN −2 )
= XN − (a + b)XN −1 + abXN −2 + b2 εN −2
...
∞
X
= XN + (−1)k (a + b)bk−1 XN −k
k=1
96
and therefore
XN +1 = εN +1 + aXN + bεN
∞
X
= εN +1 + aXN + bXN + b (−1)k (a + b)bk−1 XN −k
(10.4) k=1
∞
X
= εN +1 + (−1)k (a + b)bk XN −k
k=0
Taking a conditional expectation given XN , XN −1 , . . . , we get a formula
∞
X
X̂N +1 (N ) = (−1)k (a + b)bk XN −k
k=0
Replacing N by N + 1 in (10.4), and again, taking a conditional expectation given

XN , XN −1 , . . . , we can obtain X̂N +2 (N ) and so on.
Note that, this is exactly what we did in MA(1) case.
6. ARIMA processes. Finally, let’s consider ARIMA processes. Assume
first that the order of moving average is equal to zero. For instance, let’s consider
an ARIMA(1,1,0) process described by the equation
Yt = ∇Xt
(10.5)
(1 − aB)(Yt − µ) = εt
To simplify things even further, let us assume that µ = 0. Rewriting the equation
in terms of Xt , we get
(1 − B)(1 − aB)Xt = εt
or, dropping the back shift operators,
Xt − (1 + a)Xt−1 + aXt−2 = εt
which looks like an autoregression of the second order though it does not satisfy
the stationarity condition. Still,
XN +1 = (1 + a)XN − aXN −1 + εN +1
and we can take conditional expectations of both sides given XN , XN −1 , . . . . This
way, we get
X̂N +1 (N ) = (1 + a)XN − aXN −1
X̂N +2 (N ) = (1 + a)X̂N +1 (N ) − aXN
and so on.
Let us now consider an ARIMA(0,1,1) process described by the equation
Yt = ∇Xt
(10.6)
(Yt − m) = (1 + bB)εt
or, assuming m = 0, by the equation
Yt = Xt − Xt−1 = εt + bεt−1
This looks exactly like (10.3), with a = 1. However, Xt is not stationary and the
series (10.4) may diverge. In order to handle the situation, we have to switch to
the series Yt . Since Yt is MA(1), we have
ŶN +k (N ) = 0
97
for all k ≥ 2 and

∞
X
ŶN +1 (N ) = b (−1)k bk YN −k
k=0
But then,
ŶN +1 (N ) = X̂N +1 (N ) − XN
and therefore
∞
X
X̂N +1 (N ) = XN + b (−1)k bk (XN −k − XN −k−1 )
k=0
Beyond that,
ŶN +2 (N ) = X̂N +2 (N ) − X̂N +1 (N ) = 0
and therefore
X̂N +2 (N ) = X̂N +1 (N )
and so on.
Details. Justification of (10.1). Let X1 , . . . , Xn and Y be random variables.
Suppose we want to construct a prediction for Y given X1 , . . . , Xn . We claim that
the best predictor is given by the formula
Ŷ = E(Y |X1 , . . . , Xn ).
First of all, the conditional expectation Ŷ is a function of the desired X1 , . . . , Xn .
Next, let Ỹ be any other predictor, that is, another function of X1 , . . . , Xn . Then
E(Y − Ỹ )2 = E[(Y − Ŷ ) + (Ŷ − Ỹ )]2
= E(Y − Ŷ )2 + 2E(Y − Ŷ )(Ŷ − Ỹ ) + E(Ŷ − Ỹ )2
In order to evaluate the middle term, we note that
E(Y − Ŷ )(Ŷ − Ỹ ) = E{E(Y − Ŷ )(Ŷ − Ỹ )|X1 , . . . , Xn )}
and since both Ŷ and Ỹ are functions of X1 , . . . , Xn , the conditional expectation
E((Y − Ŷ )(Ŷ − Ỹ )|X1 , . . . , Xn ) = (Ŷ − Ỹ )E(Y − Ŷ |X1 , . . . , Xn )
= (Ŷ − Ỹ )[E(Y |X1 , . . . , Xn ) − Ŷ ] = 0
So,
E(Y − Ỹ )2 = E(Y − Ŷ )2 + E(Ŷ − Ỹ )2 ≥ E(Y − Ŷ )2
since the second term is non-negative.
Exercises
1. A process Xt satisfies the equation
Xt − 0.8Xt−1 = εt
Give a formula for an m steps ahead prediction, m ≥ 1.
2. Same question about the process
Xt = εt − 0.8εt−1
3. A process Xt is an ARIMA(0,2,0) process, that is, Xt satisfies the equation
∇2 Xt = (1 − B)2 Xt = εt
Give a formula for one step ahead prediction.
CHAPTER 3
Stationary Models: Estimation
1. Time Series and Stochastic Processes. Estimation of Mean and

Variance of a Time Series
As long as we talk about models and their properties, we are in the domain
of probability. When we switch to estimation of parameters, we move to statistics.
And, in statistics, our starting point is not a model but a data set. Through most
of this chapter, we will deal with stationary series, that is a time series that could
be viewed as a realization of a stationary process. We begin with estimation of the
basic characteristics of a stationary series.
Estimation of the mean.
Let X1 , . . . , XN be a data set. Assume that Xt is a stationary series with
unknown expectation µ. The most natural way to estimate µ is to use the sample
mean
X1 + · · · + XN
X̄ =
N
Clearly,
N
1 X
E X̄ = EXi = µ
N i=1
so X̄ is unbiased. Is it consistent? The following example demonstrates that some

assumptions may be required for consistency.
Example. Suppose Xt ≡ Y for all t where Y is a random variable with the
expectation µ. Then X̄ ≡ Y → Y as N → ∞ so X̄ is not consistent. However, in
this particular case, ρ(k) ≡ 1 for all k.
In fact, one can show that X̄ is consistent if the series

∞
X
(1.1) |ρ(k)|
k=1
converges. To verify that, let us compute the variance of X̄. We have
(1.2)
N N N N N X
2 X N
1 XX 1 XX σX
Var X̄ = Cov(Xk , Xl ) = R(k − l) = ρ(k − l)
N2 N2 N2
k=1 l=1 k=1 l=1 k=1 l=1
2
σX
= (N + 2(N − 1)ρ(1) + 2(N − 2)ρ(2) + · · · + 2ρ(N − 1))
N2
99
100
and therefore
2
σX
Var X̄ ≤
(1 + 2|ρ(1)| + 2|ρ(2)| + · · · + 2|ρ(N − 1)|)
N
∞
σ2 X
≤ X (1 + 2 |ρ(k)|) → 0
N
k=1
P∞
as N → ∞ if the sum k=1 |ρ(k)| is finite.
Estimation of the variance. It is reasonable to begin with
N
2 1X
σ̂ = (Xi − X̄)2
N i=1
We may expect this estimate to be biased. Indeed, it is biased even if we work
with a sample, that is, with a collection of i.i.d. random variables. In case of a
sample, we are able to improve the situation by considering another estimate s2
given by (B.1.2). In case of a time series, consecutive values of the series may have
non-trivial correlations, and that trick is no longer working.
Indeed, similar to (B.1.1), we can show that
N
1X
(1.3) σ̂ 2 = (Xi − µ)2 − (X̄ − µ)2
N i=1
which implies
E σ̂ 2 = σX
2
− Var X̄
σ2 2
In case of a sample, the bias Var X̄ = NX is proportional to σX . In case of a
time series, the bias Var X̄ is given by (1.2), and an unbiased estimate can’t be
constructed unless we know the autocorrelation function in advance, which never
happens. For these reasons, whenever we estimate second moments, like variance,
autocovariance and autocorrelation function, we can’t get rid of bias. We should
be happy if we get asymptotically unbiased and consistent estimates.
2
So, can we claim that σ̂X is asymptotically unbiased? Let us assume that the
series (1.1) converges. This is harmless enough, in particular because otherwise
we can’t even estimate the mean. Under this assumption, Var X̄ → 0 as N →
2 2
∞ by (1.2). But Var X̄ is precisely the bias of the estimate σ̂X . Hence σ̂X is
asymptotically unbiased under the above assumption. Is it consistent? We will
discuss that in the next section.
Details. Verification of (1.3). Easy! We write
N
X N
X
(Xi − µ)2 = ((Xi − X̄) + (X̄ − µ))2
i=1 i=1
N
X N
X
= (Xi − X̄)2 + 2(X̄ − µ) (Xi − X̄) + N (X̄ − µ)2
i=1 i=1
PN
and note that i=1 (Xi − X̄) = 0.
Exercises
1. Let Xt be a MA(1) process Xt = m + εt + aεt−1 where εt is the white noise.
(a) Compute mean and variance of Xt in terms of the coefficients of the model and
101
the variance of the noise σε2 . (b) Compute the variance of the sample mean X̄ in
terms of a and σε2 . Find the quotient
Var(X̄)
2 /N
σX
2
and graph it as a function of a, −1 < a < 1. Which one is bigger, Var(X̄) or σX /N ?
2. Let Xt be an AR(1) process Xt = c + aXt−1 + εt where εt is the white noise.
(a) Compute mean and variance of Xt in terms of the coefficients of the model and
the variance of the noise σε2 . (b) Compute the variance of the sample mean X̄ in
terms of a and σε2 . Find the quotient
Var(X̄)
2 /N
σX
2
and graph it as a function of a, −1 < a < 1. Which one is bigger, Var(X̄) or σX /N ?
2. Estimation of ACV, ACF and PACF

Let X1 , . . . , XN be a stationary series. Out of various estimates for its autoco-
variance function R(k), the following one is normally used:
N −k
1 X
(2.1) R̂(k) = R̂(−k) = (Xi − X̄)(Xi+k − X̄), k = 0, 1, . . . , N − 1
N i=1
It is called the sample autocovariance function, or sample ACV. This estimate is
definitely biased, in particular because the sum contains only N − k terms and we
divide by N . However, the resulting function R̂(k) is positively semidefinite (this
is important for spectral analysis, and that is the primary reason why the estimate
(2.1) was chosen).
In order to estimate the autocorrelation function ρ(k) and the partial auto-
correlation function φ(k), we use formulas (II.1.7) and (II.1.10) with the sample
autocovariance R̂(k) instead of R(k), that is
R̂(k)
(2.2) ρ̂(k) =
R̂(0)
and

1 ρ̂(1) ... ρ̂(k − 2) ρ̂(1)
ρ̂(1)
1 ... ρ̂(k − 3) ρ̂(2)
... ... ... ... . . .

ρ̂(k − 2) ρ̂(k − 3) ... 1 ρ̂(k − 1)

ρ̂(k − 1) ρ̂(k − 2) ... ρ̂(1) ρ̂(k)
(2.3) φ̂(k) =
1 ρ̂(1) ... ρ̂(k − 2) ρ̂(k − 1)
ρ̂(1)
1 ... ρ̂(k − 3) ρ̂(k − 2)
... ... ... ... . . .

ρ̂(k − 2) ρ̂(k − 3) ... 1 ρ̂(1)

ρ̂(k − 1) ρ̂(k − 2) ... ρ̂(1) 1
Corresponding functions are called sample autocorrelation and sample partial au-
tocorrelation, or sample ACF and sample PACF, respectively.
Our distant goal is to be able to test statistical hypotheses like ρ(k) = 0 for all
k > k0 or φ(l) = 0 for all l > l0 . To this end, we need more information about the
distribution of the sample ACV, sample ACF and sample PACF. In particular, we
102
need to know (or be able to estimate) the bias, the variance and the covariance of
the estimates.
We have already seen that the estimate for the variance, and therefore the
estimate for R(0), is asymptotically unbiased. Similar computations (see below)
allow us to show that
k N −k
E R̂(k) ≈ R(k) − R(k) − Var X̄
N N
k N −k 2 N −1
(2.4) = R(k) − R(k) − 2
σX (1 + 2 ρ(1)
N N N
N −2 2
+2 ρ(2) + · · · + ρ(N − 1))
N N
The last two terms represent the bias. It does not exceed
∞
k N −k 2 X
R(k) + σ X (1 + 2 |ρ(k)|)
N N2
k=1
P∞
If the series k=1 |ρ(k)|) converges (and, without that, even the estimation of the
mean is a problem), then the bias goes to zero as N → ∞ and our estimate is
asymptotically unbiased for every particular k.
In order to show that R̂(k) is a consistent estimate for R(k), we need to be able
to estimate the variance of R̂(k). However, in order to be able to study properties of
sample ACF and sample PACF, we need also things like Cov(R̂(k), R̂(k + l)). This
part is much harder and requires an additional assumption. Namely, we assume
that the process Xt is Gaussian. This assumption helps us to reduce expressions of
the type EXk Xl Xn Xm to the covariance function (see details below). As a result,
we can show that
∞ ∞
1 X 1 X
(2.5) Cov(R̂(k), R̂(k + l)) ≈ R(t)R(t + l) + R(t − k)R(t + k + l)
N t=−∞ N −∞
Setting l = 0, we get
∞ ∞
1 X 1 X
(2.6) Var(R̂(k)) ≈ R(t)2 + R(t − k)R(t + k)
N t=−∞ N −∞
In particular,
∞
2 2 X
(2.7) Var(R̂(0)) = Var(σ̂X )≈ R(t)2
N t=−∞
2
Formulae (2.7) and (2.6) imply, in particular, that σ̂X as well as the sample ACV
R̂(k), are consistent estimates for variance and for the autocovariance function.
Details. 1. Verification of (2.4) (sketch). To simplify computations, let
us assume that
N −k N
1 X 1 X
Xt ≈ Xt ≈ X̄
N − k t=1 N −k
k+1
(the average of the any N − k terms of the series is approximately equal to the
average of all N terms). This definitely true if k is much smaller than N . Then
N −k N −k
1 X 1 X
(Xt − X̄)(XN +k − X̄) ≈ (Xt − µ)(Xt+k − µ) − (X̄ − µ)2
N − k t=1 N − k t=1
103
where µ = EXt is the expectation of the series. However, the expectation of the
right side is exactly equal to
R(k) − Var(X̄).
On the other hand, the expression on the left is equal to NN−k R̂(k) (compare with
(2.1)). Substituting Var(X̄) from (1.2), we get (2.4).
2. Verification of (2.5) (sketch). In order to demonstrate how it could
be done, we consider a very special case. Namely, suppose that the expectation
µ = EXt is known to us in advance, and it is equal to zero. In this, rather unlikely,
situation, we can replace X̄ by zero in the formula for the sample autocovariance
function. So, the formula (2.1) becomes
N −k
1 X
(2.8) R̂(k) = R̂(−k) = Xi Xi+k , k = 0, 1, . . . , N − 1
N i=1
and (2.4) becomes

N −k N −k
(2.9) E R̂(k) = EXi Xi+k = R(k)
N N
From (2.8) and (2.9),
(2.10)
Cov(R̂(k), R̂(k + l)) = E[R̂(k)R̂(k + l)] − E R̂(k)E R̂(k + l)
N −kN −k−l
= E[R̂(k)R̂(k + l)] − R(k)R(k + l)
N N
N −k N −k−l
1 X X N −kN −k−l
= 2
E(Xs Xs+k Xt Xt+k+l ) − R(k)R(k + l)
N s=1 t=1 N N
Assuming that the series is Gaussian, we can use the formula (A.2.7). We get
E(Xs Xs+k Xt Xt+k+l ) = R(t − s)R(t − s + l)
+ R(t − s + k + l)R(t − s − k) + R(k)R(k + l)
and (2.10) becomes
N −k N −k−l
1 X X
Cov(R̂(k), R̂(k + l)) = 2 R(t − s)R(t − s + l)
N s=1 t=1
N −k N −k−l
1 X X
+ R(t − s + k + l)R(t − s − k)
N 2 s=1 t=1
(2.11)
N −k N −k−l
1 X X
+ R(k)R(k + l)
N 2 s=1 t=1
N −kN −k−l
− R(k)R(k + l)
N N
The last two terms cancel each other, and the first two can be simplified if we collect
terms with the same value of t − s.
Indeed, let i = t − s. We have −N + k + 1 ≤ i ≤ N − k − l − 1. Assume,
for instance, that l ≥ 0 and denote by bi the number of terms that correspond to
104
ACF
1.0
0.5
Lags
1 5 10 15 20 25 30
-0.5
-1.0
Figure 1. Autocorrelation function for the sunspots data shown

on Figure 4 in the Introduction.
PACF
1.0
0.5
Lags
1 5 10 15 20 25 30
-0.5
-1.0
Figure 2. Partial Autocorrelation function for the sunspots data

shown on Figure 4 in the Introduction.
i = t − s. Then the equation (2.11) can be rewritten as

N −k−l−1
1 X bi
Cov(R̂(k), R̂(k + l)) = R(i)R(i + l)
N N
i=−N +k+1
(2.12)
N −k−l−1
1 X bi
+ R(i + k + l)R(i − k)
N N
i=−N +k+1
One can show that bi = min{i−N +2, N −k−l, N −k−l−i}. In particular, P∞ bi /N ≤ 1.

Assuming that N is much larger than k and l, and that the series P∞ k=1 |R(k)|
converges (which follows immediately from the convergence of k=1 |ρ(k)|) ), we
can replace the coefficients bi /N by 1 and extend the summation in (2.12) up to
infinities. This way we get (2.5). Formulae (2.6) and (2.7) follow from (2.5) as
special cases.
Surprisingly, (2.5), (2.6) and (2.7) remain valid if the expectation is not known
though the process still has to be Gaussian. The justification, however, is much
more tedious.
Examples. On the following graphs, you can see sample ACF and sample
PACF for several time series discussed earlier. Some of those series are simulated
ones, with ACF and PACF known (shown in Chapter 2).
Exercises
105
ACF
1.0
0.5
Lags
1 5 10 15 20 25 30
-0.5
-1.0
Figure 3. Sample Autocorrelation function for the simulated

AR(1) process Xt = 0.9Xt−1 + εt (see Figure II.11). Compare
it with the theoretical ACF shown in narrow bars (see also Figure
II.12)
PACF
1.0
0.5
Lags
1 5 10 15 20 25 30
-0.5
-1.0
Figure 4. Sample Partial Autocorrelation function for the simu-

lated AR(1) process Xt = 0.9Xt−1 +εt (see Figure II.11). Compare
it with the theoretical PACF shown in narrow bars (see also Figure
II.13)
ACF
1.0
0.5
Lags
1 5 10 15 20 25 30
-0.5
-1.0

AR(1) process Xt = −0.9Xt−1 + εt (see Figure II.14). Compare it
with the theoretical ACF shown in narrow bars (see also Figure
II.15)
1. For a data set (will be posted on the web), estimate mean, variance, first
ten values of ACF (that is, ρ(1), . . . , ρ(10)) and first three values of PACF (that is,
φ(1), φ(2), φ(3)).
106
PACF
1.0
0.5
Lags
1 5 10 15 20 25 30
-0.5
-1.0

lated AR(1) process Xt = 0.9Xt−1 −εt (see Figure II.14). Compare
it with the theoretical PACF shown in narrow bars (see also Figure
II.16)
ACF
1.0
0.5
Lags
1 5 10 15 20 25 30
-0.5
-1.0

AR(2) process Xt = 1.8Xt−1 − 0.9Xt−2 + εt (see Figure II.21).
Compare it with the theoretical ACF shown in narrow bars (see
also Figure II.22)
PACF
1.0
0.5
Lags
1 5 10 15 20 25 30
-0.5
-1.0

lated AR(2) process Xt = 1.8Xt−1 −0.9Xt−2 +εt (see Figure II.21).
Compare it with the theoretical PACF shown in narrow bars (see
also Figure II.23)
107
ACF
1.0
0.5
Lags
1 5 10 15 20 25 30
-0.5
-1.0

ARMA(1,1) process Xt = 0.9Xt−1 +εt + 0.9εt−1 (see Figure II.24).
also Figure II.25)
PACF
1.0
0.5
Lags
1 5 10 15 20 25 30
-0.5
-1.0

lated ARMA(1,1) process Xt = 0.9Xt−1 + εt + 0.9εt−1 (see Figure
II.24). Compare it with the theoretical PACF shown in narrow
bars (see also Figure II.26)
3. Properties of the Sample ACF and Sample PACF

Sample ACF. Sample autocorrelation function ρ̂(k) has been defined by the
formula (2.2) as a quotient of two estimates R̂(k) and R̂(0). Hence, it inherits some
properties of R̂(k), in particular positive semidefiniteness. Also, ρ̂(0) = 1 and it
could be shown that |ρ̂(k)| ≤ 1 for all k.
Expectation and variance of ρ̂(k) are not easy to find. It could be shown
however, that, for large N ,
N −k
(3.1) E ρ̂(k) ≈ ρ(k).
N
If, in addition, the process is Gaussian, then
∞
1 X 2
Var ρ̂(k) ≈ [ρ (t) + ρ(t + k)ρ(t − k)
(3.2) N t=−∞
+ 2ρ2 (k)ρ2 (t) − 4ρ(k)ρ(t)ρ(t − k)]
108
ACF
1.0
0.5
Lags
1 5 10 15 20 25 30
-0.5
-1.0

ARMA(1,1) process Xt = 0.9Xt−1 +εt −0.85εt−1 (see Figure II.27).
also Figure II.28). As we can see, the series is indeed indistin-
guishable from the white noise, as promised. Here the sample size
T = 400.
PACF
1.0
0.5
Lags
1 5 10 15 20 25 30
-0.5
-1.0

lated ARMA(1,1) process Xt = 0.9Xt−1 + εt − 0.85εt−1 (see Figure
II.27). Compare it with the theoretical PACF shown in narrow bars
(see also Figure II.29)
and
∞
1 X
Cov(ρ̂(k), ρ̂(l)) ≈ [ρ(t)ρ(t + l − k) + ρ(t + l)ρ(t − k)
(3.3) N t=−∞
+ 2ρ(k)ρ(l)ρ2 (t) − 2ρ(k)ρ(t)ρ(t − l) − 2ρ(l)ρ(t)ρ(t − k)]
In particular, the values ρ̂(k) may be highly correlated for different k. Also, it could
be shown that, for large N , the distribution of ρ̂(k) is approximately normal.
Justification of (3.1), (3.2) and (3.3) is really hard. Some arguments that are
designed to illustrate the ideas and to outline the assumptions, could be found at
the end of the section.
Examples. 1. White noise. If Xt is a white noise, then ρ(k) = 0 for all
k 6= 0 and therefore
1
E ρ̂(k) ≈ 0, Var ρ̂(k) ≈ , Cov(ρ̂(k), ρ̂(l)) ≈ 0
N
whenever k, l > 0, k 6= l.
109
ACF
1.0
0.5
Lags
1 5 10 15 20 25 30
-0.5
-1.0
Figure 13. Sample ACF can be computed for any series, not
necessarily a stationary one. Here you can see a sample ACF for
the ibm data (see Figure I.33).
ACF
1.0
0.5
Lags
1 5 10 15 20 25 30
-0.5
-1.0
Figure 14. Sample ACF of the service data (see Figure 2 in the
Introduction) as an example of sample ACF for the data with trend
and seasonality.
2. MA(1) process. Suppose the process Xt satisfies (II.2.1). In this partic-

b
ular case, ρ(0) = 1, ρ(±1) = 1+b 2 and ρ(k) = 0 for all other k (see (II.2.3)). Then,
for k > 1,
1
E ρ̂(k) ≈ 0, Var ρ̂(k) ≈ (1 + 2ρ2 (1))
N
and
Cov(ρ̂(k), ρ̂(m)) ≈ 0
if k > 1, m > k + 2.
3. MA(l) process. Similar to the previous example,
1
E ρ̂(k) ≈ 0, Var ρ̂(k) ≈ (1 + 2ρ2 (1) + · · · + 2ρ2 (l))
N
whenever k > l, and
Cov(ρ̂(k), ρ̂(m)) ≈ 0
if k > l, m > k + 2l.
4. AR(1) process. Suppose Xt satisfies (II.3.1). In this particular case,
ρ(k) = a|k| and the sum of the series (3.2) can be evaluated. Assuming that k is
large and the parameter a is not too close to 1 or −1 (so that ak can be ignored),
110
we have
1 1 + a2
Var ρ̂(k) ≈
N 1 − a2
Sample PACF. Properties of the sample PACF are much harder to establish.
For certain, φ̂(l) is a consistent (and therefore asymptotically unbiased) estimate
for the partial autocorrelation function φ(l), just because the sample ACF is a
consistent estimate for the autocorrelation function. However, its distribution is
really hard to find.
As we can guess, properties of φ̂(l) are especially important to us in case if Xt
is an AR(k) process. Indeed, φ(l) = 0 for all l > k for such processes and this
property is a defining characteristic of AR(k) model. In this particular case, it
could be shown that, if l > k, then
(3.4) E φ̂(l) ≈ 0.
If, in addition, the process is Gaussian, then
1
(3.5) Var φ̂(l) ≈
N
Moreover, it could be shown that, φ̂(l), l > k are approximately independent, and
their distribution is approximately normal.
Choice of the order of the autoregression model. The above results may
help us to chose the order of the autoregression model by examining the sample
PACF. Indeed, if the order of autoregression is k, then random variables φ̂(l), l >
k are (approximately) i.i.d. normal random variables with √ zero expectation and
variance 1/n. For each of them, the probability P {|φ̂(l)| > 2/ N } is approximately
√ M , for instance M = 25, and consider all l, k < l ≤ k+M ,
.05. Choose some number
such that |φ̂(l)| > 2/ N . The number of such ls has a binomial distribution with
parameters .05 and M , and its distribution can be calculated. For instance, the
probability that
P {no more than three of the values
(3.6) √
|φ̂(l)|, l = k + 1, . . . , l = k + 25 exceed 2/ N } ≈ .96
So,√the decision rule could be as follows: look for the smallest k such that |φ̂(k+1)| <
2/ N and the condition in (3.6) (or similar) is satisfied. This could be a reasonable
first guess.
For example, suppose the first values of sample PACF are equal to φ̂(1) =
0.7, φ̂(2) = −0.4, φ̂(3) = −0.5, φ̂(4) = 0.2 and all other
√ values do not exceed 0.1.
Suppose also that the sample size N = 400. Then 2/ N = 0.1. In order to satisfy
the condition in (3.6), we could take l = 1. However, next three values are all
greater than 0.1 in absolute value. For that reason, we have to advance to l = 4, so
that the next value φ̂(5) is less than 0.1. If, however, φ̂(2) = −0.02, we would be in
a difficult position. Formally, the decision rule suggests k = 1. However, φ̂(3) and
φ̂(4) deviate from zero, respectively, by ten standard deviations and four standard
deviations. It is therefore quite reasonable to go for l = 4 nonetheless.
Choice of the order of a moving average model. In a similar way, we
can study MA(l) models. If the order of the moving average model is l, then
random variables ρ̂(k), k > l have (approximately) normal distribution with zero
111
expectation and variance

1
(1 + 2ρ2 (1) + · · · + 2ρ2 (l))
N
Unfortunately, they are no longer independent though the range of dependence is
finite (it is 2l + 1). So, any references to the binomial distribution are no longer
possible. Also, the values of the autocorrelation function are not known to us (and
have to be replaced by their estimates). For these reasons, we arrive at the following
decision rule. For each l = 0, 1, 2, . . . we compute the value
2 p
Cl = √ 1 + 2ρ̂2 (1) + · · · + 2ρ̂2 (l).
N
If the process is actually MA(l), then, for every k > l, we have P {|ρ̂(k)| > Cl } ≈ .05.
So, we choose the smallest l such that |ρ̂(l + 1)| < Cl and “most” of the further
values ρ̂(l + 2), ρ̂(l + 3), . . . ,, do not exceed Cl in absolute value.
For example, suppose this time that the first values of sample ACF are equal
to ρ̂(1) = 0.7, ρ̂(2) = −0.4, ρ̂(3) = −0.5, ρ̂(4) = 0.2 and all the others do not exceed
0.1. Suppose
√ also that the sample size N = 400. We start with l = 0. Then
C0 = 2/ N = 0.1 and we have four first values of sample ACF that exceed 0.1. So
we move to l = 1. We compute C1 = 0.14 and compare the values ρ̂(k), k ≥ 2 with
this bound. Still, next three values of sample ACF exceed 0.14 in absolute value, so
we have to compute C2 = 0.146. We now compare the values ρ̂(k), k ≥ 3 with C3 .
Still, ρ̂(3) and ρ̂(4) exceed 0.146. Next, C3 = 1.55. Still, ρ̂(4) is greater than that,
so we compute C4 = 1.56 and find no more values of the sample ACF that exceed
this bound. So, l = 4 would be a first guess. If, for a change, N = 225 instead,
then C0 = 0.13, C1 = 0.187, C2 = 0.195, C3 = 0.207 and l = 3 passes the criterion.
Examples. 1. We begin with the Sunspots data (see Figure 4 in the Introduc-
tion). It is an annual data with the famous eleven years cycle. Its sample ACF and
sample PACF are shown on the Figures 1 and 2. We can’t suggest any reasonable
moving average model here, in particular because ACF values for lags 11, 22 and
so on are way too big. However, the graph of PACF suggests that autoregression
of the second order may work.
2. Another data set (Figure 15) represents a concentration of a certain chemi-
cal. Its sample ACF (Figure 16) does not contain any negative values, so the series
is probably non-stationary. Indeed, all the stationarity tests turn positive, and
when we look at the graph of the series, we feel confused. Could we consider the
change in the level as a very long cycle? Or it is a trend? However, its PACF (Fig-
ure 17) suggests that an autoregression of second order might work. Nonetheless, it
looks more reasonable to assume that the series is non-stationary, and switch to the
increments instead. For this new series, shown on Figure 18, its ACF (Figure 19)
and PACF (Figure 20) suggest moving average of the order 1 or autoregression of
the order 6 as reasonable guesses. So, we could try ARIMA(6,1,0) or ARIMA(0,1,1)
for the original series.
3. IBM stock market data revisited (Chapter 1, Section 4). The series (Figure
I.33) is definitely non-stationary, and we should switch to the increments (Chapter
1, Figure 34). The increments look stationary. However, as we have pointed out
in Section I.4, the series probably changes its behavior somewhere around point
235. Indeed, if we estimate ACF and PACF over the interval [1, 235] (see Figures
21 and 22), then autoregression of the order 2 or moving average of the order
1 look reasonable. However, sample ACF estimated over the rest of the series
112
Concentration Data
18.0
17.5
17.0
16.5
Time
50 100 150 200
Figure 15. Concentration data. The series is hardly stationary.
ACF
1.0
0.5
Lags
1 5 10 15 20 25 30
-0.5
-1.0
Figure 16. Sample ACF of the concentration data. Looks like

ACF for a non-stationary series.
PACF
1.0
0.5
Lags
1 5 10 15 20 25 30
-0.5
-1.0
Figure 17. Sample PACF of the concentration data. Could

AR(2) work?
look differently (Figure 23, white noise?). If we disregard the change of behavior
and estimate the ACF for the whole series (Figure 24), we may arrive at wrong
conclusions (white noise for the whole series is not working).
Details. Justification of (3.1) and (3.2) (sketch). Denote Mk = E R̂(k)
and δk = R̂(k) − Mk , so that Eδk = 0. By (2.6), the variance of R̂(k) goes to zero
113
Increments
1.5
1.0
0.5
Time
50 100 150 200
-0.5
-1.0
Figure 18. Increments of the concentration data. This one is

stationary for sure.
ACF
1.0
0.5
Lags
1 5 10 15 20 25 30
-0.5
-1.0
Figure 19. Sample ACF of the increments of the concentration

data. MA(1)?
PACF
1.0
0.5
Lags
1 5 10 15 20 25 30
-0.5
-1.0
Figure 20. Sample PACF of the increments of the concentration

data. AR(6)?
as N → ∞, hence δk → 0 as N → ∞ as well. So, if N is large enough,

R̂(k) M k + δk
ρ̂(k) = =
R̂(0) M0 + δ 0
Mk + δ k δ0 δ2
(3.7) + 02 + . . .

= 1−
M0 M0 M0
Mk δk δ 0 Mk
= + − + higher order terms
M0 M0 M02
114
ACF
1.0
0.5
Lags
1 5 10 15 20 25 30
-0.5
-1.0
Figure 21. Sample ACF of the increments of the IBM data esti-
mated over the first 235 points. MA(1)?
PACF
1.0
0.5
Lags
1 5 10 15 20 25 30
-0.5
-1.0
Figure 22. Sample PACF of the increments of the IBM data es-
timated over the first 235 points. AR(2)?
ACF
1.0
0.5
Lags
1 5 10 15 20 25 30
-0.5
-1.0
Figure 23. Sample ACF of the increments of the IBM data esti-
mated over the rest of the series. White noise?
Dropping the higher order terms, we get
Mk E(δk ) E(δ0 )Mk Mk E R̂(k)

E ρ̂(k) ≈ + − 2 = =
M0 M0 M0 M0 E R̂(0)
Substituting the expectations from (2.4) and dropping the higher order terms there,
we get (3.1). In order to find the variance, we use (3.7), once again dropping the
115
ACF
1.0
0.5
Lags
1 5 10 15 20 25 30
-0.5
-1.0
Figure 24. Sample ACF of the increments of the IBM data es-
timated over the whole series. Looks like white noise (though it’s
not working).
higher order terms. We have

Var(δk ) Mk2 Var(δ0 ) Mk Cov(δ0 , δk )
Var ρ̂(k) ≈ + −2
M02 M04 M03
Now, Var(δk ) = Var R̂(k) and Cov(δ0 , δk ) = Cov(R̂(k), R̂(0). So, using (2.4), (2.5),
(2.6) and (2.7) (and therefore assuming that Xt is Gaussian), we arrive at (3.2). In
a similar way, we can establish (3.3).
Exercises
1. Estimated values φ̂(1), . . . , φ̂(30) of sample PACF are given (will be posted
on the web). Following the procedure described in Section 3.3, make a suggestion
about the order of an autoregression model AR(k) assuming that (a) N = 100 (b)
N = 400 (c) N = 900 (d) N = 1600. Explain your decision.
2. Estimated values ρ̂(1), . . . , ρ̂(30) of sample ACF are given (will be posted
on the web). Following the procedure described in Section 3.3, make a suggestion
about the order of a moving average model MA(l) assuming that (a) N = 100 (b)
N = 400 (c) N = 900 (d) N = 1600. Explain your decision.
4. Estimation of the Parameters of the Autoregression Model

How could we fit an autoregression model to the data X1 , . . . , XN ?
For the purpose of estimation, it is convenient to write down a general autore-
gression model of the order k with non-zero expectation as follows:
(4.1) (Xt − µ) = a1 (Xt−1 − µ) + · · · + ak (Xt−k − µ) + εt
where εt is a white noise with variance σε2 . (Compare with (II.6.1); We actu-
ally changed the signs of all of the coefficients ai .) Assume that the order k of
the model has been selected. We therefore need to estimate unknown parameters
µ, a1 , a2 , . . . , ak and σε2 . It can be done in different ways, we’ll briefly discuss five
of them.
1. Exact least squares. The most obvious idea is to use the least squares, that
is to minimize
X
(4.2) ((Xt − µ) − a1 (Xt−1 − µ) − · · · − ak (Xt−k − µ))2
t
116
In order to find the minimum, we have to find the partial derivatives with respect
to the parameters and set them to zero. Unfortunately, the model is not linear
with respect to the parameters µ, a1 , . . . , ak which leads to complicated equations.
It could be made linear if we rewrite the equation (4.1) as
(4.3) Xt = a0 + a1 Xt−1 + · · · + ak Xt−k + εt
where
a0 = µ(1 − a1 − a2 − · · · − ak ).
Then the expression (4.2) becomes
X
(4.4) (Xt − a0 − a1 Xt−1 − · · · − ak Xt−k )2
t
Setting partial derivatives to zero, we get linear equations for a0 , . . . , ak . Solving
for them, we get the estimates â0 , . . . , âk . Finally, we set
â0
µ̂ =
1 − â1 − · · · − âk
This way we get simple expressions for a0 , . . . , ak and a really ugly looking estimate
for the expectation µ.
2. Approximate least squares. This is a modification of the previous method.
We, first, estimate the mean, that is replace µ by X̄ and apply least squares after
that. In other words, we minimize the sum
X
((Xt − X̄) − a1 (Xt−1 − X̄) − · · · − ak (Xt−k − X̄))2 .
t
Once again, we are getting linear equations here.
3. Yule—Walker estimates. As we have seen, the values of the autocorrela-
tion function must satisfy the Yule—Walker equations. Taking into account that
ρ(0) = 1, ρ(i) = ρ(−i), we are getting k equations which relate a1 , . . . , ak and
ρ(1), . . . , ρ(k):
(4.5) ρ(t) − a1 ρ(t − 1) − · · · − ak ρ(t − k) = 0, t = 1, 2, . . . , k
(Compare with (II.6.6). Since the equation (II.6.1) has been replaced by (4.1), we
have negative signs in (4.5).) Replacing the values of an autocorrelation function
by their estimates ρ̂(1), . . . , ρ̂(k), and solving them for a1 , . . . , ak , we are getting
the Yule—Walker estimates for the coefficients a1 , . . . , ak . This method does not
provide us with the estimate for the mean, we have to estimate it separately (for
instance, using the sample mean X̄).
4. Conditional maximum likelihood method. We compute a conditional density
of Xk+1 , . . . XN given X1 , . . . , Xk , plug in the observations and maximize the ex-
pression with respect to unknown parameters. For the sake of brevity, we assume
that k = 1 (first order autoregression).
In order to speak about likelihood, we have to make assumptions about the
distribution of the process. So, we assume that the process is Gaussian, that is the
noise εt has normal distribution with zero expectation and the variance σε2 . It is
convenient to use the (4.3) form of the model, which becomes
Xt = a0 + a1 Xt−1 + εt
Random variables
εt = Xt − a0 − a1 Xt−1 , t = 2, . . . , N
117
are i.i.d. normal variables and their joint distribution is equal to

2 2

1 y2 + · · · + yN
(4.6) fε2 ,...,εn (y2 , . . . , yn ) = N −1 exp −
(2π) 2 σεN −1 2σε2
Since the values of the noise do not depend on the past values of the process,
(4.6) also represents the conditional density of ε2 , . . . , εN given X1 . However, if
X1 is known to us, then X2 , . . . , XN could be obtained from ε2 , . . . , εN by a linear
transformation. This way, we eventually arrive at the following expression for the
conditional density of X2 , . . . , Xn given X1 :
(4.7)
1 1
exp − 2 [(x2 − a0 − a1 x1 )2

fX2 ,...,XN |X1 =x1 (x2 , . . . , xN ) = N −1
(2π) 2 σεN −1 2σε
+ · · · + (xN − a0 − a1 xN −1 )2 ]

(just a modification of (4.6). Namely, we have replaced yl by xl − a0 − a1 xl−1 ).

In order to find the (conditional) maximum likelihood estimates, we have to
substitute the observations (that is, replace the remaining xl ’s by the data Xl
and maximize the resulting expression with respect to the unknown parameters).
Taking the logarithm, we get the following log-likelihood function
N −1
L(a0 , a1 , σε2 |X1 . . . XN ) = − log(2π) − (N − 1) log σε
2
(4.8) 1
− 2 Q(a0 , a1 )
2σε
where
(4.9) Q(a0 , a1 ) = [(X2 − a0 − a1 X1 )2 + · · · + (XN − a0 − a1 XN −1 )2 ]
We can first minimize Q(a0 , a1 ) with respect to a0 , a1 , and after that, find the
maximum in (4.8) with respect to σε2 . However, if we compare (4.9) and (4.4), we
see that the conditional maximum likelihood method is, in fact, equivalent to the
exact least squares estimates (though, it provides also an estimate for the variance
σε2 ).
5. Exact maximum likelihood estimates. Again, we limit our discussion by the
autoregression of the first order. To simplify things even further, we suppose that
the expectation of the processes is known to us and is equal to zero (which means
that a0 = 0). The equation (4.1) becomes
Xt = aXt−1 + εt
Once again, we have to assume that the process is Gaussian. We have already found
the conditional density of X2 , . . . , Xn given X1 . In our special case, it is equal to
1 1
exp − 2 [(x2 − ax1 )2

fX2 ...XN |X1 =x1 (x2 , . . . , xN ) = N −1 N −1
(4.10) (2π) 2 σε 2σ ε
+ · · · + (xN − axN −1 )2 ]

(we have dropped a0 and replaced a1 by a). On the other hand, the density of X1
is also known to us. Namely, X1 has normal distribution with zero expectation and
2
variance σX = σε2 /(1 − a2 ), and its density is therefore equal to
√
1 − a2 1 − a2 2
(4.11) fX1 (x1 ) = √ exp − x
2πσε 2σε2 1
118
We have to multiply the conditional density (4.10) by the density (4.11) and plug
in the data. We get the following expression:
√
1 − a2 1
N exp − 2 Q(a)
(2π) 2 σεN 2σε
where
Q(a) = [(1 − a2 )X12 + (X2 − aX1 )2 + · · · + (XN − aXN −1 )2 ]
We need to maximize the resulting expression with respect to the unknown param-
eters. Taking the logarithm, we get the following log likelihood function:
1 N 1
L(a, σε2 ) = log(1 − a2 ) − log(2π) − N log σε − 2 Q(a)
2 2 2σε
Note that estimation of a can no longer be separated from estimation of the variance
σε2 . So, in order to find the maximum, we have to use some numerical algorithms
(we can’t just set the partial derivatives to zero).
Further comments. Asymptotically, all the methods are equivalent. Nowa-
days standard method is the exact maximum likelihood. Other methods provide
‘rough’ estimates for the parameters which can be used as initial values for the
iterations. Also, they could be used for diagnostic.
Exercises
1. Given values of the sample ACF (will be posted on the web), find the
Yule—Walker estimates for the coefficients of AR(2) model.
5. Estimation of Parameters of Moving Average and ARMA models

We will limit our discussion by a simplest possible situation: we assume that
the expectation is equal to zero, and the order of the moving average is equal to 1.
The model therefore becomes
(5.1) Xt = εt − bεt−1
where b is the unknown parameter, |b| < 1, and εt is the Gaussian white noise with
unknown variance σε2 .
All methods of estimation of the parameters of the model are somehow related
to the maximum likelihood method.
Assume that the process is Gaussian. Then X1 , . . . , XN have a multivariate
Gaussian distribution with zero expectation and known covariance matrix of a sim-
ple structure. So, the joint density of X1 , . . . , XN can be written and the maximum
likelihood method could be used, at least in theory. However, this straightforward
application of the maximum likelihood did not prove to be fruitful (computational
problems, especially in case of a general model, are very hard; forecasting is diffi-
cult as well). We’ll discuss another two modifications of the maximum likelihood
method.
119
From (5.1), we have

ε1 = X1 + bε0
ε2 = X2 + bε1
= X2 + bX1 + b2 ε0
...
εN = XN + bXN −1 + · · · + bN −1 X1 + bN ε0
Similar to what was done for the autoregression model, we write down the joint
density of ε1 , . . . , εN , note that if coincides with the conditional density of ε1 , . . . , εN
given ε0 and, finally, write down the expression for the joint density of X1 , . . . , XN
and ε0 . It has the form
1 1
(5.2) fX1 ...XN ,ε0 (x1 , . . . , xN , ε0 ) = exp{− 2 Q(b)}
(2π)N/2 σεN 2σε
where
Q(b) = ε20 + (X1 + bε0 )2 + (X2 + bX1 + b2 ε0 )2 + . . .
+ (XN + bXN −1 + · · · + bN −1 X1 + bN ε0 )2
Very unfortunately, Q(b) depends on the unknown ε0 . We have two options.
1. Conditional likelihood. We set ε0 = 0. We can minimize Q(b) in b (note that
it is a polynomial of degree 2(N − 1)). After that, we can easily find an estimate
for σε2 .
2. Exact Likelihood. Since ε0 is not known to us, we treat ε0 as an unknown
parameter. Given b, Q(b) is quadratic in ε0 and could be easily minimized. As
a result, we get ε0 as a function of X1 , . . . , XN and b and plug it back into the
expression for Q(b). After that, we minimize in b, and, finally, estimate the variance
σε2 .
Mixed ARMA models can be treated in a similar way. We outline the method in
the simplest situation (ARMA(1,1) model with zero expectation). It is convenient
for us to write down the model as
(5.3) Xt + aXt−1 = εt − bεt−1
We have
ε1 = X1 + aX0 + bε0
ε2 = X2 + aX1 + bε1 =
= X2 + (a + b)X1 + abX0 + b2 ε0
ε3 = X3 + aX2 + bε2 =
= X3 + (a + b)X2 + (a + b)bX1 + ab2 X0 + b3 ε0
...
εN = XN + (a + b)XN −1 + · · · + (a + b)bN −2 X1 + abN −1 X0 + bN ε0
As above, we write down a joint density of ε1 , . . . , εN , notice that it coincides with
the conditional density of ε1 , . . . , εN given ε0 and X0 , and finally, find a conditional
density of X1 , . . . , XN given ε0 and X0 . It still has the form
1 1
(5.4) fX1 ...XN |X0 ,ε0 (x1 , . . . , xN ) = exp{− 2 Q(a, b, X0 , ε0 )}
(2π)N/2 σεN 2σε
120
with
Q(a, b, X0 , ε0 ) = (X1 + aX0 + bε0 )2 + (X2 + (a + b)X1 + abX0 + b2 ε0 )2
+ (X3 + (a + b)X2 + (a + b)bX1 + ab2 X0 + b3 ε0 )2 + ...
(5.5)
+ (XN + (a + b)XN −1 + . . .
+ (a + b)bN −2 X1 + abN −1 X0 + bN ε0 )2
As above, we can either set X0 and ε0 to zero (conditional likelihood), or write
down the joint density of X0 , X1 , . . . , XN , ε0 and treat X0 and ε0 as unknown
parameters (preferable). In the latter case, Q is quadratic in X0 and ε0 , so we can
find them as functions of the observations X1 , . . . , XN and the parameters a, b.
6. Distribution of the Estimates and Confidence Regions*

Suppose we estimated the parameters of an ARMA(k, l) model by means of
the maximum likelihood (that is, by means of some software that implements the
maximum likelihood approach). In order to construct confidence intervals for the
parameters, we need to know the distribution of the estimates, at least we need
to know their variances. The answer to this question comes from the theory of
maximum likelihood estimation.
Denote by c = (c1 , . . . , cq ) the vector of parameters of the model (say, for
MA(1), it is the expectation m and the parameter b). Let L(c) be the log likelihood
function and ĉ be the maximum likelihood estimate for c. Denote
∂ 2 L(c)

Lij = ,
∂ci ∂cj c=ĉ
the second order partial derivative of the log likelihood function evaluated at ĉ. It
follows from the maximum likelihood estimation that the variance-covariance ma-
trix for ĉ = (ĉ1 , . . . , ĉq ) is approximately equal to {−Lij }−1 , that is, to the inverse
matrix to the matrix with elements −Lij . In particular, this gives us the variances
of the of the estimates ĉi as diagonal elements of {−Lij }−1 . Assuming in addi-
tion that ĉi are asymptotically normal, we can construct approximate confidence
intervals for the parameters ci as ĉi plus or minus two standard deviations.
However, the estimates are likely to be highly correlated, which makes testing
of statistical hypotheses difficult. For that reason, we would like to construct a
confidence region for the whole vector c instead of a collection of confidence intervals
for each of the parameters.
To outline the main ideas, we begin with a simplest possible case. Let X1 , . . . , Xn
be a sample from a normal distribution with unknown parameters m, σ 2 , and let X̄
and s2 be the sample mean and the sample variance. It is well known (check basic
statistical courses) that s2 and X̄ are independent, X̄ has normal distribution with
parameters m, σ 2 /n and (n − 1)s2 /σ 2 has a χ2 distribution with n − 1 degrees of
freedom.
Denote
X n
Q(a) = (Xi − a)2
i=1
In terms of Q, the sample variance could be written as s2 = Q(X̄)/(n − 1). We
begin with the identity
(6.1) Q(m) = Q(X̄) + n(X̄ − m)2 = (n − 1)s2 + n(X̄ − m)2
121
Dividing (6.1) by σ 2 , we have

Q(m) (n − 1)s2 n(X̄ − m)2
= +
σ2 σ2 σ2
On the left, we have a sum of squares of n standard normal random variables
(Xi − m)/σ, that is the sum of n independent χ2 (1) random variables. So, the left
side is a χ2 (n)-distributed random variable. On the right, the first term has χ√2 (n−1)
distribution and the second is square of a standard normal random variable n(X̄ −
m)/σ 2 , that is another χ2 (1). Moreover, terms on the right are independent.
For this reason,
n(X̄ − m)2 Q(m) − Q(X̄)
2
=
s Q(X̄)/(n − 1)
has F distribution with 1, n − 1 degrees of freedom. Therefore, with probability
1 − α,
Q(m) − Q(X̄)
(6.2) ≤ Fα (1, n − 1)
Q(X̄)/(n − 1)
where Fα (1, n − 1) is the corresponding percentile for the F distribution. Solving
for Q(m), we get
1
Q(m) ≤ Q(X̄)(1 + Fα (1, n − 1))
n−1
Now, we could solve for m and get a confidence interval for the unknown mean √ m.
However, it boils down to the classical confidence interval X̄ ± tα/2 (n − 1)s/ n, in
particular because a square of t(k) random variable has F (1, k) distribution.
Let us now consider a multiple linear regression model. Let Yi , i = 1, . . . , n and
Xij , i = 1, . . . , n, j = 1, . . . , q be related by the equation
Yi = c1 Xi1 + · · · + cq Xiq + εi , i = 1, . . . , n
where c1 , . . . , cq are the unknown parameters and εi are independent normal random
variables with zero expectation and unknown variance σε2 . As we know from (a little
bit more advanced) statistics, estimates for c1 , . . . , cq could be found by the method
of least squares (which is equivalent to the method of maximum likelihood). To
simplify the notation, let c = (c1 , . . . , cq ). Denote
n
X
(6.3) Q(c) = (Yi − c1 Xi1 − · · · − cq Xiq )2
i=1
Let ĉ be the point of the minimum of Q(c). The vector ĉ is the least squares
estimate for c. From the theory of the multiple linear regression, we know that
ĉ has a multivariate normal distribution and, it is independent from the sum of
the squares of the residuals Q(ĉ). Moreover, Q(ĉ)/(n − q) is an unbiased estimate
for σε2 and the 2 2
P quotient Q(ĉ)/σε has a χ (n − q) distribution. On the other hand,
Q(c)/σε2 = i ε2i /σε2 has a χ2 (n) distribution, and the difference Q(c) − Q(ĉ) is
independent from the sum of the squares of the residuals Q(ĉ). We therefore have
a decomposition
Q(c) Q(c) − Q(ĉ) Q(ĉ)
χ2 (n) = 2
= + 2 = χ2 (q) + χ2 (n − q)
σε σε2 σε
From here,
(Q(c) − Q(ĉ))/q
Q(ĉ)/(n − q)
122
has F distribution with q, n − q degrees of freedom. So, with probability (1 − α),

(Q(c) − Q(ĉ))/q
≤ Fα (q, n − q).
Q(ĉ)/(n − q)
Solving for Q(c), we get

q
(6.4) Q(c) ≤ Q(ĉ) 1 + Fα (q, n − q)
n−q
There is an alternative way to proceed. Note that σ̂ε2 = Q(ĉ)/(n − q) is an
estimate for the variance of the noise. If n is large, we may assume that σ̂ε2 ≈ σε2
and therefore
Q(c) − Q(ĉ) Q(c) − Q(ĉ) Q(c) − Q(ĉ)
= ≈
σ̂ε2 Q(ĉ)/(n − q) Q(ĉ)/n
has (approximately) a χ (q) distribution. Using a percentile for χ2 (q) distribution
2
and solving for Q(c), we get an approximate version of (6.4),

1
(6.5) Q(c) ≤ Q(ĉ) 1 + χ2α (q)
n
Since c is a vector and Q is a quadratic function, (6.4) (or (6.5)) defines an
ellipsoid in the space of the parameters. It serves as a confidence region for the
vector of parameters (c = (c1 , . . . , ck ) belongs to the region with probability 1 − α).
The above ideas could be applied to AR, MA and ARMA models. To do so, we
need a replacement for the residual sum of squares Q as a function of parameters
of the model. In case of AR(k) model we can, for instance, define Q(a) as
N
X
Q(a) = Q(a1 , . . . , ak ) = (Xt + a0 + a1 Xt−1 + · · · + ak Xt−k )2
t=k+1
which is, essentially, the same as (6.3). In case of MA and ARMA models, defi-
nition of the function Q(a) is more complicated. Let us, for instance, consider an
ARMA(1,1) model with zero expectation, (like in Section 5), and Let Q(a, b, ε0 , X0 )
be the function defined by (5.5). It is quadratic in ε0 and X0 . Hence, we can easily
find a minimum
Q(a, b) = min Q(a, b, ε0 , X0 )
X0 ,ε0
which is now a function of a and b only. In a similar way, we can compute the sum
of the squares Q(a) for any ARMA model. The corresponding critical region could
be then constructed as in (6.4), where q should be the total number of unknown
parameters in the model (the order of autoregression plus the order of moving
average, plus one for the mean if the mean is not known), and n should be not the
number of observations N but the number of the residuals in the model (in most
cases, N minus the order of the autoregression).
7. Comparison of Models
Which model should be used for a given data set? Should we use an AR(k) or
MA(l) or a mixed model? How to choose k, l, that is how to choose a structure of
a model? There exist two ways to handle the problem. According to one of them,
we use a certain numerical criterion as an overall quality of a fitted model. We
then choose the model with the best score. According to the second approach, we
develop a concept of a “good” model. Our goal is to find a “good” model (there
123
could be several of them, or there could be none) and then, pick one of the “good”
ones.
Both approaches have advantages and disadvantages. With the first approach,
we don’t have to think, everything is left to a software. However, there is no reliable
overall criterion. All existing criteria have certain flaws. Second approach gives
more flexibility, but, requires a certain level of expertise (you have to understand
what you are doing). We begin with some formal criteria.
Residual Variance. Let q = k + l + 1 be the total number of parameters in the
model. As a first guess, we may consider a quotient Q(â)/(N − q) which, in case of
multiple linear regression, works as an unbiased estimator for the variance of the
noise σε2 . Unfortunately, it is well known in statistics that such overall criterion
over-parameterizes the model. The penalty for the use of extra parameters, hidden
in the denominator, is way too small.
Finite Prediction Error. For AR(k) models, we could use another function,
N +k
F P E(k) = (R̂(0) + â1 R̂(1) + · · · + âk R̂(k))
N −k
which is an unbiased estimate for the variance of the one step ahead prediction
error.
Akaike Information criterion, or AIC. Formally, AIC is defined as
AIC = −2L(θ̂) + 2q
where L(θ̂) is the maximal value of the log likelihood function. Since the log
likelihood function for ARMA models is equivalent to Q(â), we see that AIC is
roughly equivalent to the Residual variance criterion plus some extra penalty 2q.
It is also roughly equivalent to the logarithm of the FPE criterion whenever both
of them are applicable.
Unfortunately, the AIC criterion also tends to over-parameterize the model.
Bayesian Information Criterion, or BIC. To cure the over-parametrization
problem related to the AIC and FPE criteria, another modification was suggested.
It is defined by the formula
1 σ̂ 2
BIC(q) = AIC(q) + q(log N − 1) + q log( ( X2 − 1))
q σ̂ε
This criterion is a bit better than the AIC (extra penalty term q(log N − 1) is
important, over-parametrization is not likely), but still ...
“Good model” approach. Roughly speaking, residuals in a “good” model
should be a white noise. At the same time, there should be no signs of over-
parametrization, so that no parameters can be dropped from the model.
A number of procedures could be used in order to check if the residuals are the
white noise. To begin with, we should check if the first values (like ρ(1), ρ(2) and
possibly ρ(3)) of the sample autocorrelation function for the residuals are significant.
The following test, called the Portmanteau lack-of-fit test, could also be used. Let
ρ̂(k) be the sample ACF for the residuals. If the ARMA(k, l) model is adequate,
then
N (ρ̂(1)2 + ρ̂(2)2 + · · · + ρ̂(m)2 )
has, approximately, a χ2 (m − k − l) distribution. A number m of the values used
in the test is at our disposal. It should not be comparable with the data length N .
124
Data
4.5
4.0
3.5
3.0
Time
0 20 40 60 80
Figure 25. Coal production data.
On the other hand, m − k − l should not be too small. On practice, k + l rarely

exceeds five, and any m ≥ 20 could be used (author’s personal favorite is 25).
As for the signs of over-parametrization, the p-values for the parameters should
not be too big. However, there is one problem. In most statistical packages, we con-
trol only k, d, l, and we can’t drop any intermediate coefficients from the equation.
For instance, if we consider an autoregression of the order 3, and if the coefficient
a1 looks insignificant, we can’t drop it from the model nonetheless (we simply have
no means to do that). So, we should look mostly at the p-values for the senior co-
efficients ak and bl . Also, some signs of over-parametrization may be seen from the
residual ACF. For instance, imagine that we are using the Yule—Walker estimates
in order to estimate the coefficients of an autoregression model (never do this for
real). It means that we, practically, set the first values of the residual ACF to zero.
The higher is the order of the autoregression, the more values of the residual ACF
are practically zeroes. Maximum likelihood estimates do not have this unpleasant
property. However, if first several values of the residual ACF are practically zeroes,
it could be a sign of over-parametrization.
Examples. 1. We begin with coal production data, 96 points, monthly, years
1953-1959 (Figure 25). Its ACF (Figure 26) does not suggest any reasonable MA
model, it slowly goes down and then takes big negative values after lag 15 or so.
Its PACF (Figure 27) suggests an AR(2) model. For an AR(2) model, we get
m̂ = 3.802, â1 = 0.4896 and â2 = 0.3308, all p-values are practically zeroes (the
biggest of them equals 0.0016). So, all the parameters are statistically significant.
In order to qualify for a ‘good’ model, the residuals should pass white noise test.
The value of Portmanteau test is only 12.03, with p-value 0.97, unbelievably good
(remember, it is a χ2 random variable with 23 degrees of freedom). So, the AR(2)
model is definitely good. To be on the safe side, we could try AR(3) and AR(1)
as well. For AR(3) model, we get â3 = 0.0918 with p-value 0.4054, so the senior
coefficient in the model is not statistically significant. For AR(1) model, we get all
the parameters significant (no surprise) and the value 21.44 for the Portmanteau
test, p-value 0.6124, still good. However, the first value of the ACF for the residuals
equals -0.2323 and the 95% confidence bound is 0.2, which means that the residuals
are probably not the white noise yet. So, we stick to the AR(2) model.
2. We consider now profit margin data set, 80 points, quarterly, years 1953-
1972 (Figure 28). Our first step is to study its ACF and PACF. Formally, its ACF
(Figure 29) suggests MA(3) model but a big chunk of negative values that starts
125
ACF
1.0
0.5
Lags
1 5 10 15 20 25 30
-0.5
-1.0
Figure 26. ACF for Coal production data. Moving average mod-
els are not likely.
PACF
1.0
0.5
Lags
1 5 10 15 20 25 30
-0.5
-1.0
Figure 27. PACF for Coal production data. AR(2)?
after the lag 10, some of them as big as −0.5, makes it questionable. So, we look
at the PACF instead (Figure 30). It clearly suggests AR(1) model. Trying an
AR(1) model, we get estimates m̂ = 4.699 and â1 = 0.876 with p-values that are
practically zeroes (no surprise). As for the residuals the Portmanteau test value
is 22.34, with p-value 0.5587, which looks really good. However, when we look at
ACF and PACF values for the residuals, we see a number of values that are pretty
big (ρ̂(4) = −0.2152, ϕ̂(4) = −0.2596, ϕ̂(8) = −0.2673 with 95% confidence bounds
±0.2191.
For that reason, we probably should try to add more parameters to the model.
However, an AR(2) model does not solve the problem. First of all, estimates
m̂ = 4.716, â1 = 1.026 and â2 = −0.173, and p-value for a2 is 0.1311, so the
senior coefficient in the model is not statistically significant. And, we still have
big values of ACF and PACF for the residuals, the value of Portmanteau test does
not significantly drop. So, we try a mixed model ARMA(1,1) instead. We get
m̂ = 4.714, â1 = 0.8281 and b̂1 = −0.2024, the biggest p-value is 0.007, so all the
coefficients of the model are statistically significant. The situation with ACF and
PACF for residuals improves somewhat, ρ̂(4) = −0.205 and ϕ̂(4) = −0.208 are now
within confidence bounds, the value of the Portmanteau test is 19.06 with p-value
0.6976. Only ϕ̂(8) = −0.2296 is slightly out of range.
So, we have two reasonably good models here, AR(1) and ARMA(1,1). In order
to decide, which one to choose, we could drop the last few values, fit the model to
126
Data
5.5
5.0
4.5
4.0
Time
0 20 40 60 80
Figure 28. Profit Margin data.
ACF
1.0
0.5
Lags
1 5 10 15 20 25 30
-0.5
-1.0
Figure 29. ACF for Profit Margin data. Moving average models
are not likely.
PACF
1.0
0.5
Lags
1 5 10 15 20 25 30
-0.5
-1.0
Figure 30. PACF for Profit Margin data. AR(1)?
the remaining part of the series and compare the prediction with actual data. If
we do so, we might like ARMA(1,1) better (Figures 31, 32).
3. We switch now to Parts Availability data, 100 points, weekly, years 1971-
1972 (Figure 33). This time, both Kendall and Spearman tests turn positive even
at 99% level of confidence, so the series is probably not stationary. However, if
we look at the graph of the series, we may consider it as a stationary series with
long-periodic cycle. Indeed, Kendall and Spearman tests are designed for a trend-
plus-white-noise data type and they can be easily confused in case of long-periodic
cycles. So, we should probably try to find an ARMA model for the original data
127
Data
6.0
5.5
5.0
4.5
4.0
3.5
Time
70 72 74 76 78 80 82
Figure 31. Profit Margin data. Prediction according to AR(1) model.
Data
6.0
5.5
5.0
4.5
4.0
3.5
Time
70 72 74 76 78 80 82
Figure 32. Profit Margin data. Prediction according to

ARMA(1,1) model looks slightly better.
Data
8.8
8.6
8.4
8.2
8.0
7.8
Time
0 20 40 60 80
Figure 33. Parts availability data. Possibly, a non-stationary

series, Kendall and Spearman tests positive.
as well as try to find an ARMA model for the increments, so that it would be an
ARIMA model for the original series.
We begin with the series itself. Its ACF (Figure 34) does not suggest any
reasonable MA model. However, its PACF (Figure 35) suggests an AR(3) model.
Indeed, this way we get a ‘good’ model, m̂ = 8, 229 with p-value practically zero and
â1 = 0.1505, â2 = 0.2416, â3 = 0.3518 with p-values 0.1739, 0.0348 and 0.0024 re-
spectively. So, all the coefficients of the model except a1 are statistically significant.
Portmanteau test value is 14.16 with p-value 0.8956. Naturally, we should try other
128
ACF
1.0
0.5
Lags
1 5 10 15 20 25 30
-0.5
-1.0
Figure 34. ACF for Parts Availability data. Moving average

models are not likely.
PACF
1.0
0.5
Lags
1 5 10 15 20 25 30
-0.5
-1.0
Figure 35. PACF for Profit Margin data. AR(3) might work.
models. Surprisingly, an AR(2) model is also good—this time, all the parameters
are statistically significant and the Portmanteau test value is 19.01 with p-value
0.7008. For an AR(4) model, the senior coefficient a4 is statistically insignificant.
If we try ARMA(3,1) model instead, then all three autoregression coefficients are
statistically insignificant, a sure sign of wrong parametrization.
We now switch to the increments (Figure 36). Its ACF (Figure 37) suggests
MA(1) model. If we look at the PACF instead (Figure 38), it clearly suggests
AR(2) model. Trying an MA(1) model, we get b̂1 = 0.7249 with p-value that is
practically zero, and the value of the Portmanteau test is 12.23, with p-value 0.9722,
unbelievably good. So, it is an ARIMA(0,1,1) model for the original data set. For
MA(2) model, we get the senior coefficient statistically insignificant.
For an AR(2) model, we get â1 = −0.7646 and â2 = −0.4401, both p-values are
practically zeroes. The value of the Portmanteau test is 14.5 with p-value 0.9177,
still great. So it is an ARIMA(2,1,0) model for the original series. For AR(3)
model, the senior coefficient is insignificant, for AR(1) the residuals fail the white
noise test. As a result, we get four ‘good’ models — AR(3), AR(2), ARIMA(2,1,0)
and ARIMA(0,1,1). To choose one out of them, we drop six points from the series,
construct a prediction and compare it with actual data set. The results are shown
on Figures 39, 40, 41 and 42. The ARIMA(2,1,0) prediction looks slightly better.
4. Next, we revisit the concentration data revisited (Example 2 on p.111,

Figure 15). This series is also on the margin of stationarity—should we consider
129
Data
0.5
Time
20 40 60 80
-0.5
Figure 36. Increments of Parts availability data.
ACF
1.0
0.5
Lags
1 5 10 15 20 25 30
-0.5
-1.0
Figure 37. ACF for the increments of Parts Availability data.

MA(1) might work.
PACF
1.0
0.5
Lags
1 5 10 15 20 25 30
-0.5
-1.0
Figure 38. PACF for the increments of Parts Availability data.

AR(2) might work.
the change of the level as a trend or it is a long periodic cycle type behavior?
The first thirty values of its ACF (Figure 16) are non-negative, this might be an
evidence of a trend. Kendall and Spearman tests are also positive. However, its
PACF (Figure 17) suggests an AR(2) model. Indeed, an autoregression process of
second order may contain a long periodic cycle. Trying an AR(2) model, we get an
estimate for the mean m̂ = 17.06 with p-value that is practically zero (no surprise)
and estimates â1 = 0.4263 and â2 = 0.2576, both p-values are practically zeroes.
So, all the parameters of the model are statistically significant, and the value of the
130
Data
9.5
9.0
8.5
8.0
Time
80 82 84 86 88 90 92
Figure 39. Parts Availability data. Prediction according to

AR(3) model.
Data
9.5
9.0
8.5
8.0
Time
80 82 84 86 88 90 92

AR(2) model.
Data
9.5
9.0
8.5
8.0
Time
80 82 84 86 88 90 92

ARIMA(2,1,0) model.
Portmanteau test is 26.97 which is quite alright for a χ2 distribution with 22 degrees
of freedom (p-value 0.2573). So, it is a ‘good’ model. If we try an AR(1) model
instead, all parameters are statistically significant but the value of the Portmanteau
test jumps up to 47.44, p-value 0.003. Therefore, the Portmanteau test indicates
that the residuals are have non-trivial correlations, this is not a ‘good’ model. No
surprise—cyclic behavior is related to complex roots, and complex roots may show
up if the order of autoregression is at least 2. For AR(3) model, the value of the
131
Data
9.5
9.0
8.5
8.0
Time
80 82 84 86 88 90 92

ARIMA(0,1,1) model.
Crops Data
400
300
200
100
Years
1500 1550 1600 1650 1700 1750 1800 1850
Figure 43. Price Index data.
Portmanteau test is 26.14 with p-value 0.2456, practically the same as for AR(2)
model. However, the estimate â3 = 0.0808 has p-value 0.2719, and that is the senior
coefficient. So, we have a clear evidence of over-parametrization.
If instead we treat the data as a non-stationary series, we switch to the incre-
ments of the series (Figure 18). Looking at the ACF for the increments (Figure 19),
we may want to try MA(1) model. Its PACF (Figure 20 suggests that AR(6) might
work. For a MA(1) model, we get b̂1 = 0.701 with p-value 0.0023, so the parameter
is statistically significant. As for the residuals, we get 29.14 for the Portmanteau
test, the p-value is 0.1879. So, we’ve got yet another good model ARIMA(0,1,1)
for the original series. The AR(6) model is also good, and so is AR(5) model.
However, AR(6) has a much better value of the Portmanteau test (p-value 0.7346
instead of 0.1864). So, we have three models to compare, AR(2), ARIMA(6,1,0) and
ARIMA(0,1,1). Out of them, ARIMA(6,1,0) produces somewhat better prediction.
5. To conclude, consider the Price index data (Figure 43). It is surely non-
stationary. Moreover, the magnitude of the oscillations looks proportional to the
level. For this reason, we take the logarithm (Figure 44), and after that, switch to
the increments (Figure 45).
Sample ACF of the increments (Figure 46) suggests MA(3) model. The model
looks fine, the ACF of the residuals (Figure 47) does not show any significant
correlations. However, the senior coefficient, b̂3 = 0.22 has a p-value is 0.157 which
is way too big. So this coefficient, probably, can be dropped from the model.
132
Log Crops Data

6
5
4
3
2
1
Years
1550 1600 1650 1700 1750 1800 1850
Figure 44. Logarithm of the Price Index data.
Increments
0.6
0.4
0.2
Years
1550 1600 1650 1700 1750 1800 1850
-0.2
-0.4
-0.6
-0.8
Figure 45. Increments of the Logarithm of the Price Index data.
ACF
1.0
0.5
Lags
1 5 10 15 20 25 30
-0.5
-1.0
Figure 46. Sample ACF of the increments of the Logarithm of

the Price Index data. MA(3)?
However, MA(2) model does not look good at all. Residuals show some non-
trivial correlations (at lags 3 and 8, see Figure 48), and the value of the Portmanteau
test with m = 25 is 54, which is way too much for the χ2 distribution with 23 degrees
of freedom (p-value 0.0003).
Looking at the PACF of the data instead (Figure 49), we may decide to try
autoregression of the order 4. However, the value of the Portmanteau test is 39.6,
the p-value is 0.008 and the residuals do not look like the white noise (Figure 50).
Looking at the graph of the PACF once again, we may decide to try autore-
gression of the order 8. The coefficients of the model look fine, the residuals do not
133
ACF
1.0
0.5
Lags
1 5 10 15 20 25 30
-0.5
-1.0
Figure 47. ACF of the residuals in MA(3) model. White noise,

isn’t it?
ACF
1.0
0.5
Lags
1 5 10 15 20 25 30
-0.5
-1.0
Figure 48. ACF of the residuals in MA(2) model. Significant

correlations for lags 3 and 8? Portmanteau test is a No.
PACF
1.0
0.5
Lags
1 5 10 15 20 25 30
-0.5
-1.0
Figure 49. Sample PACF of the increments of the Logarithm of

the Price Index data. AR(4)?
show any significant correlations (Figure 51), the value of the Portmanteau test is
21 with p-value 0.202. But, eight parameters to estimate look too many.
Trying mixed models, we come across ARMA(2,1) model. Its residuals do not
show any significant correlations (Figure 52), the value 28 for the Portmanteau test
is reasonable (p-value is 0.18) and all the coefficients have reasonable standard errors
(autoregression coefficients are a1 = 0.79 with standard error 0.15 and a2 = −0.33
with standard error 0.06, and the moving average coefficient b1 = 0.84 with standard
error 0.06).
134
ACF
1.0
0.5
Lags
1 5 10 15 20 25 30
-0.5
-1.0
Figure 50. ACF of the residuals in AR(4) model. Still, significant

correlations for lags 6 and 8. Portmanteau test is a No.
ACF
1.0
0.5
Lags
1 5 10 15 20 25 30
-0.5
-1.0
Figure 51. ACF of the residuals in AR(8) model. Finally?
ACF
1.0
0.5
Lags
1 5 10 15 20 25 30
-0.5
-1.0
Figure 52. ACF of the residuals in ARMA(2,1) model. Even better?
By the way, note that the Akaike Information Criterion (AIC) assigns the best
value to another model, ARMA(8,1). Since both AR(8) and ARMA(2,1) look good,
including extra parameters into them should not be reasonable. Indeed, when we
look at the ACF for the residuals (Figure 53), we see that the first five or six values
are practically zeros, which may indicate that the model is over-parameterized.
As above, we can compare the predictions made according to the models.
They are shown on Figures 54, 55 and 56. It is not so easy to choose from, how-
ever the predictions according to ARIMA(2,1,1) model and ARIMA(8,1,1) model
135
ACF
1.0
0.5
Lags
1 5 10 15 20 25 30
-0.5
-1.0
Figure 53. ACF of the residuals in ARMA(8,1) model, best AIC

value. Over-parametrization? First 5 values of ACF are practically
zeroes.
Crops Data
400
350
300
250
200
150
Years
1850 1855 1860 1865 1870
Figure 54. Crops data. Prediction according to ARIMA(2,1,1)

model fitted to the logarithm of the original series.
Crops Data
400
350
300
250
200
150
Years
1850 1855 1860 1865 1870
Figure 55. Same for ARIMA(8,1,0) model.
(the over-parameterized one) look slightly better than the prediction according to
ARIMA(8,1,0) model.
Exercises
1. An ARMA(3,1) model has been fitted to some data set. All coefficients of
the model are significant. The values of the residual ACF will be posted on the
136
Crops Data
400
350
300
250
200
150
Years
1850 1855 1860 1865 1870
Figure 56. Same for ARIMA(8,1,1) model.
web. The sample size is N = 400. Are the residuals a white noise? What would
you say if N = 1600 instead.
CHAPTER 4
Spectrum and its Estimation
1. Spectral Density of a Stationary Process. Examples

Example. We begin with the following example. Let
X n
(1.1) Xt = Ai cos(ωi t + θi )
i=1
where ωi are some constants, Ai are independent random variables with EA2i = Fi2
and θi are i.i.d., independent from all Aj and uniformly distributed on [0, 2π]. It
could be shown that the stochastic process Xt is strictly stationary. Let’s compute
its auto-covariance function. We have
R(k) = EXt Xt+k
X
= E(Ai Aj )E(cos(ωi t + θi ) cos(ωj (t + k) + θj )
i,j
X 1
= E(Ai Aj ) [E cos(ωi t + ωj (t + k) + θi + θj )
i,j
2
+ E cos(ωi t − ωj (t + k) + θi − θj )
Now,
E cos(ωi t + ωj (t + k) + θi + θj ) = E cos(ωi t − ωj (t + k) + θi − θj ) = 0
if i 6= j (why?). If i = j, then the first expectation is still zero and the second one
reduces to
E cos(ωi t − ωi (t + k) + θi − θi ) = cos(−ωi k) = cos(ωi k)
Hence, the autocovariance function turns out to be equal to
n
1X 2
(1.2) R(k) = F cos(ωi k)
2 i=1 i
2
In particular, the total variance σX = R(0) of the process can be decomposed into
sum of the variances of the corresponding harmonics:
n
2 1X 2
σX = F
2 i=1 i
Switching to the autocorrelation function, we get
n
1 1X 2
(1.3) ρ(k) = 2 F cos(ωi k)
σX 2 i=1 i
In a sense, this example is not typical. For instance, the autocorrelation func-
tion ρ(k) does not tend to zero as k → ∞ (and that was the necessary condition
137
138
for most of the methods discussed earlier). Nonetheless, we’d like to have a way to
study a frequency structure of a stationary process. The following theorem holds.
Theorem (Wold’s representation). SupposeP∞ the autocovariance function R(k)
of the process Xt satisfies the condition k=1 |R(k)| < ∞. There exists a non-
negative continuous even function fX (ω), called the power spectral density, or just
spectral density, of the process Xt , such that
Z π
(1.4) R(k) = cos(ωk)fX (ω)dω
−π
The function fX (ω) can be recovered from the autocovariance function R(k) by the
formula
∞
1 X
(1.5) fX (ω) = [R(0) + 2 R(k) cos(ωk)]
2π
k=1
The spectral density has the property
Z π
2
fX (ω)dω = σX
−π
(just set k = 0 in (1.4)). We will drop the index X whenever it causes no confusion.
Sometimes, it is convenient to normalize the spectral density by dividing by the
∗
variance of the process. Normalized spectral density fX (ω) = σ12 fX (ω) is related
X
to the autocorrelation function:
Z π
∗
ρ(k) = cos(ωk)fX (ω)dω
−π
(1.6) ∞
∗ 1 X
fX (ω) = [1 + 2 ρ(k) cos(ωk)]
2π
k=1
For the normalized spectral density, we have
Z π
f ∗ (ω)dω = 1
−π
We may consider (1.4) as a continuous version of the (1.2) (instead of a finite
(or countable) sum, we have an integral). In fact, (1.1) is an example of a stationary
process with so-called discrete spectrum, and the above theorem describes processes
with continuous spectra. There exists a version of the theorem that allows to handle
discrete and continuous spectra at the same time, but that requires the use of the
Lebesgue-Stieltjes integral.
Remarks. 1. Why do we limit ourselves by the interval [−π, π]? Let us return
to the example (1.1). Since
cos(ωk + φ) = cos((ω + 2π)k + φ)
for all integer k, it is clear that we could assume that all the frequencies ωi belong
to, say, the interval (−π, π] (since the observations are discrete, we can’t distinguish
oscillations with frequencies that differ by a multiple of 2π).
2. Alternative version of (1.4), (1.5) and (1.6). Recall that
1
cos(ωk) = (eiωk + e−iωk ).
2
and
eiωk = cos(ωk) + i sin(ωk)
139
Since the spectral density is an even function,

Z π
sin(kω)fX (ω)dω = 0
−π
for all k. Therefore

Z π Z π
cos(kω)fX (ω)dω = eikω fX (ω)dω
−π −π
and the formula (1.4) can be rewritten as

Z π
(1.7) R(k) = eikω fX (ω)dω.
−π
The other way around, since R(k) = R(−k), we have

∞
1 X
fX (ω) = [R(0) + 2 R(k) cos(ωk)]
2π
k=1
∞
1 X
= [R(0) + R(k)(eiωk + e−iωk )]
2π
k=1
(1.8) ∞ ∞
1 X X
= [R(0) + R(k)eiωk + R(−k)e−iωk ]
2π
k=1 k=1
∞
1 X
= R(k)eiωk
2π −∞
In a similar way,
∞
∗ 1 X
(1.9) fX (ω) = ρ(k)eiωk
2π −∞
and
Z π
∗
(1.10) ρ(k) = eikω fX (ω)dω.
−π
Speaking the language of real and complex analysis, we see that the spectral
density is the Fourier transform of the autocovariance function.
3. Since the spectral density is an even function, it looks reasonable to consider
only non-negative frequencies. However, a concept of a spectral density can be
extended to the multivariate case when we observe two (or more) series and have
to work with cross-covariance function CXY (k) = Cov(Xt , Yt+k ) and so-called cross-
spectrum. Reduction to non-negative frequencies does not work there.
4. There exists a representation of a Gaussian stationary process with zero
expectation as a mixture of random harmonics, which could be considered as a
continuous version of the representation (1.1). It is based on the concept of a
stochastic integral. We don’t discuss it here.
Spectra of some stationary processes.
1. White noise. Suppose Xt are i.i.d.Then ρ(k) = 0 for all k 6= 0 and (1.5)
becomes
1 2
fX (ω) = σ
2π X
140
So, the spectral density of the white noise is a constant, all frequencies participate
with the same magnitude (like a uniform mixture of all colors produces the white
color).
2. MA(1). Suppose now that
Xt = εt + bεt−1
where εt is the white noise and |b| < 1. We have
R(0) = (1 + b2 )σε2 , R(1) = R(−1) = bσε2
and R(k) = 0 if k = ±2, ±3, . . . . Therefore
1
(1.11) fX (ω) = σε2 (1 + 2b cos(ω) + b2 )
2π
3. AR(1). Suppose
Xt = aXt−1 + εt
where εt is the white noise and |a| < 1. It is easier to begin with the normalized
spectral density. As we know, ρ(k) = a|k| . Hence
∞
∗ 1 X
fX (ω) = (1 + 2 ak cos(ωk))
2π
k=1
In order to find a closed form, we use the alternative formula (1.9). We have
∞
∗ 1 X |k| iωk
fX (ω) = a e
2π −∞
∞ ∞
1 X X
= (1 + (aeiω )k + (ae−iω )k )
2π
k=1 k=1
1 aeiω ae−iω
= 1+ +
2π 1 − ae iω 1 − ae−iω
1 (1 − ae )(1 − ae−iω ) + aeiω (1 − ae−iω ) + ae−iω (1 − aeiω )
iω
=
2π (1 − aeiω )(1 − ae−iω )
1 1 − aeiω − ae−iω + a2 + aeiω − a2 + ae−iω − a2
=
2π 1 − aeiω − ae−iω + a2
2
1 1−a
=
2π 1 − 2a cos(ω) + a2
Since
2 σε2
σX = ,
1 − a2
the non-normalized spectral density is given by
1 1
(1.12) fX (ω) = σε2
2π 1 − 2a cos(ω) + a2
This straightforward approach is not working for a general ARMA(k, l) process
(there is no simple formula for autocovariance function). We need to find another
way to do it, and the following general result is very helpful. Suppose that two
141
stationary processes Xt and Yt with zero expectation are related to each other by
the equation
N
X
(1.13) Yt = gn Xt−n
n=−N
In terms of the shift operator B, the equation (1.13) can be rewritten as

Yt = G(B)Xt
where X
G(z) = gn z n
n
It could be shown (see details below) that the spectral densities fX (ω) and fY (ω)
are related to each other by the formula
fY (ω) = |G(eiω )|2 fX (ω)
(1.14) = G(eiω )G(eiω )fX (ω)
= G(eiω )G(e−iω )fX (ω)
(all expressions on the right are equal to each other).
Spectral density of the ARMA(k, l) process. Suppose Xt satisfies the
equation
Xt + a1 Xt−1 + · · · + ak Xt−k = εt + b1 εt−1 + · · · + bl εt−l
or, in terms of the shift operator,
α(B)Xt = β(B)εt
where, as usual, α(x) = 1 + a1 x + · · · + ak xk , β(x) = b0 + b1 x + · · · + bl xl and εt is
a white noise. Put
Yt = β(B)εt = εt + b1 εt−1 + · · · + bl εt−l
According to (1.14), we have
fY (ω) = |β(eiω )|2 fε (ω)
On the other hand,
Yt = α(B)Xt
and therefore
fY (ω) = |α(eiω )|2 fX (ω)
Hence
|β(eiω )|2 iω 2
2 1 |β(e )|
(1.15) fX (ω) = fε (ω) = σ ε
|α(eiω )|2 2π |α(eiω )|2
For instance, in case of MA(1), α(x) = 1, β(x) = 1 + bx and
|β(eiω )|2 = |1 + b cos(ω) + ib sin(ω)|2
(1.16)
= (1 + b cos ω)2 + b2 sin2 ω = 1 + 2b cos ω + b2
which agrees with (1.11).
In case of AR(2) process
Xt + a1 Xt−1 + a2 Xt−2 = εt
142
Spectrum
0.6
0.5
0.4
0.3
0.2
0.1
Frequency
π/4 π/2 3π/4 π
Figure 1. Spectrum of a stationary series Xt = εt + 0.9εt−1
Spectrum
0.6
0.5
0.4
0.3
0.2
0.1
Frequency
π/4 π/2 3π/4 π
Figure 2. Spectrum of a stationary series Xt = εt − 0.9εt−1
we have α(z) = 1 + a1 z + a2 z 2 and therefore

α(eiω ) = (1 + a1 cos ω + a2 cos(2ω) + i(a1 sin ω + a2 sin(2ω))
and therefore
(1.17) |α(eiω )|2 = 1 + a21 + a22 + 2a1 (1 + a2 ) cos ω + 2a2 cos(2ω)
(verify that).
Examples. 1. Let us begin with MA(1) process
Xt = εt + 0.9εt−1 .
By (1.16),
1
(1.81 + 1.8 cos(ω)),
fX (ω) =
2π
it is monotone decreasing, it reaches its maximum at zero and minimum at π.
(Figure 1). For the equation
Xt = εt − 0.9εt−1
it becomes
1
(1.81 − 1.8 cos(ω))
fX (ω) =
2π
and the picture is the opposite (maximum at π, minimum at zero, Figure 2).
2. Let us consider now an AR(1) process given by the equation
Xt = 0.9Xt−1 + εt
143
Spectrum
15
10
Frequency
π/4 π/2 3π/4 π
Figure 3. Spectrum of a stationary series Xt = 0.9Xt−1 + εt
Spectrum
15
10
Frequency
π/4 π/2 3π/4 π
Figure 4. Spectrum of a stationary series Xt = −0.9Xt−1 + εt
By (1.12), the spectrum is given by the formula

1 1
fX (ω) =
2π 1.81 − 1.8 cos(ω)
(see Figure 3, note sharp peak at zero). If we consider the process
Xt = −0.9Xt−1 + εt
instead, the formula becomes
1 1
fX (ω) =
2π 1.81 + 1.8 cos(ω)
(Figure 4, sharp peak at π).
3. For AR(2) processes, there is a number of possibilities. The shape of the
spectrum may be similar to what we have seen for AR(1) processes (peak at zero
or peak at π), but it could be more complicated. For instance, consider a process
Xt = 0.1Xt−1 + 0.6Xt−2 + εt .
The equation (1.17) implies
|α(eiω )|2 = 1.37 − 0.08 cos ω − 1.2 cos(2ω),
and its graph has two peaks, a bigger one at zero and a smaller one at π, see Figure
5.
For the equation
Xt = −0.5Xt−1 + 0.4Xt−2 + εt ,
144
Spectrum
1.5
1.0
0.5
Frequency
π/4 π/2 3π/4 π
Figure 5. Spectrum of a stationary series Xt = 0.1Xt−1 +

0.6Xt−2 + εt .
Spectrum
10
5
1
0.50
0.10
Frequency
π/4 π/2 3π/4 π
Figure 6. Spectrum of a stationary series Xt = −0.5Xt−1 +

0.4Xt−2 + εt , logarithmic scale.
we have
|α(eiω )|2 = 1.41 + 0.6 cos ω − 0.8 cos(2ω),
which again results in two peaks, however this time the peak at zero is much smaller
than the peak at π and can only be seen in logarithmic scale (Figure 6).
Finally, consider a process
Xt = 1.8Xt−1 − 0.9Xt−2 + εt .
This time, (1.17) implies
|α(eiω )|2 = 5.05 − 6.84 cos ω + 1.8 cos(2ω).
and therefore
1 1
fX (ω) =
2π 5.05 − 6.84 cos ω + 1.8 cos(2ω)
The graph is shown on Figure 7). Recall that this AR(2) process has a clearly
visible quasi-periodic behavior (Figure II.21), which is reflected by the spectrum.
Comments and Explanation. Let us begin with MA(1) and AR(1) pro-
cesses. As above, assume σε2 = 1. Spectrum of MA(1) process
Xt = εt + bεt−1
is given by the formula
1
fX (ω) = (1 + 2b cos(ω) + b2 )
2π
145
Spectrum
150
100
50
Frequency
π/10 π/4 π/2 3π/4 π
Figure 7. Spectrum of a stationary series Xt = 1.8Xt−1 −

0.9Xt−2 + εt . The maximum at the frequency 0.318 corresponds
to the period 19.8.
and therefore it is a scaled cosine function. However, cos(ω) is monotone decreasing

on the interval [0, π]. Therefore if b > 0, then the spectrum has the maximum at
zero and minimum at π; if b < 0, then the situation is opposite. The value at the
maximum depends on how large is |b| (if |b| = 1, then the maximum equals 1/π
and the minimum equals zero; however, this model is not invertible).
Let us now consider an AR(1) process
Xt = aXt−1 + εt
By (1.12),
1 1
fX (ω) = .
2π 1 + a2 − 2a cos(ω)
The expression 1 + a2 − 2a cos(ω) is similar to that for MA(1) model (b has been
replaced by −a), however it is now in the denominator. If a is positive, 1 + a2 −
2a cos(ω) has minimum at zero and maximum at π; for negative a it is the other
way around. Hence, the spectrum has a peak at zero for positive a, and it has a
peak at π if a is negative. However, there is one difference. The minimal value of
1 + a2 − 2a cos(ω) is equal to (1 − |a|)2 , so it can be made arbitrary small if a is
close to 1 or −1. Since it is in the denominator, the spectrum may have a very
sharp peak at zero if a is close to 1 (or at π if a is close to −1), as we have seen on
Figures 3 and 4.
Let us now consider an AR(2) process described by the equation
α(B)Xt = εt
2
where α(z) = 1 + a1 z + a2 z . The shape of the spectrum of an AR(2) process,
as well as its behavior, depends on the roots of the polynomial α. However, it is
more convenient to work with roots z1 , z2 of the polynomial z 2 + a1 z + a2 instead,
like we did in Section 2.4 1. Suppose first that the roots z1 , z2 are real. Then
z 2 + a1 z + a2 = (z − z1 )(z − z2 ) and
α(z) = (1 − z1 z)(1 − z2 z).
Then
|α(eiω )|2 = |(1 − z1 eiω )|2 |(1 − z2 eiω )|2
1 Roots of α are reciprocal to the roots of z 2 + a z + a , see Problem C1.9

1 2
146
and therefore the spectrum of the process equals, up to a factor 1/(2π), to the
product of the spectra of two AR(1) processes, one of them corresponds to the root
z1 and the other one corresponds to the root z2 . So, if both roots are positive, then
the spectrum has a peak at zero, and if both roots are negative, the spectrum has a
peak at π. If the roots are of the opposite sign, we have two peaks, one peak at zero
and the other one at π. Since the size of the peak depends on the absolute value of
the corresponding parameter, one of the peaks might look insignificant compared
to the other one. Say, for the a process
Xt = 0.1Xt−1 + 0.6Xt−2 + εt ,
we have z1 ≈ 0.826, z2 ≈ −0.726, so we have two roots of opposite sign. Since
|z1 | > |z2 |, the peak at zero has to be bigger than the peak at π (Figure 5).
For the equation
Xt = −0.5Xt−1 + 0.4Xt−2 + εt
we have z1 = 0.43, z2 = −0.93. Again, we have two roots of opposite signs. How-
ever, this time the negative root z2 is close to −1 and the positive root z1 is not
close to 1 at all. For that reason, the peak at zero is very small compared to that
at π (Figure 6).
In case of complex roots, AR(2) process has a quasi-periodic behavior and the
spectrum has a peak at a certain frequency. Namely, if z 2 + a1 z + z 2 has a pair of
complex roots z1,2 = reiθ , then the spectrum has a peak at a frequency ωmax that
is close to θ unless r is very small or θ is close to 0 or to π. Specifically, r and θ
should satisfy the condition
2r
(1.18) | cos(θ)| < .
1 + r2
Under this condition, ωmax could be found from the equation
1 + r2
(1.19) cos(ωmax ) = cos(θ).
2r
(see details below). The value of the spectrum at ωmax equals
1 1
(1.20)
2π (1 − r )2 sin2 (θ)
2
so we have very sharp peaks if r is close to 1.

For instance, for the process
Xt = 1.8Xt−1 − 0.9Xt−2 + εt ,
we have complex roots 0.9 ± 0.3i. Converting them to the exponential form, we get
re±iθ where r ≈ 0.949 and θ ≈ 0.322. By (1.19), the spectrum has a peak at the
frequency ωmax ≈ 0.318 (Figure 7).
Consider now a MA(2) process
Xt = εt + b1 εt−1 + b2 εt−2
with β(z) = 1+b1 z +b2 z . Since |β(eiω )|2 can be again found from (1.17), the shape
2
of the spectrum depends on the roots of the polynomial z 2 + b1 z + b2 . However,

|β|2 is now in the numerator and therefore peaks should be replaced by minimums
and the other way around. Also, |β|2 is just a linear combination of cos(ω) and
cos(2ω), so we can’t expect any sharp peaks or narrow troughs.
147
Finally, let us discuss AR(k) and ARMA(k, l) processes. Consider, for instance,
an AR(k) process
α(B)Xt = εt
with α(z) = 1 + a1 z + · · · + ak z k . The equation
z k + a1 z k−1 + · · · + ak = 0
has k roots z1 , z2 , . . . , zk , some of them may be complex. All of the roots should
satisfy the condition |zi | < 1. Since the roots of the polynomial α are reciprocal to
z1 , z2 , . . . , zk , the latter can be factorized as
α(z) = (1 − z1 z) . . . (1 − zk z)
Since complex roots come in pairs (a complex root and its conjugate), we can
therefore represent α(z) as a product
α(z) = α1 (z)α2 (z) . . . αl (z)
where each polynomial αi is either linear or quadratic, linear factors have the form
(1 − zj z) and correspond to real roots zj and quadratic factors have the form
(1 − zj z)(1 − zj z) and correspond to a pair of complex roots zj , zj . But then,
|α(eiω )|2 = |α1 (eiω )|2 |α2 (eiω )|2 . . . |α2 (eiω )|2
and therefore the spectrum of the AR(k) process equals, up to a 1/(2π)l−1 factor,
to the product of spectra of corresponding AR(1) and AR(2) processes. Therefore,
to every positive root there corresponds a peak at zero, to every negative root there
corresponds a peak at π and to every pair of complex roots there may correspond
a peak at corresponding frequency. Height and sharpness of peaks depend on
how big are the roots in absolute value. Roots with absolute value that is very
close to one, produce very sharp and tall peaks and the other peaks may look
insignificant. Similar arguments work for ARMA(k, l) processes. In fact, this way
we can approximate any positive continuous function.
Details. 1. Verification of (1.14). As a first step, we establish a relation

between autocovariance functions of X and Y . We have
X X
RY (k) = E(Yt Yt+k ) = E( gn Xt−n gm Xt+k−m )
n m
XX XX
= gn gm EXt−n Xt+k−m = gn gm RX (k + n − m)
n m n m
Therefore
∞
1 X iωk X X
fY (ω) = e gn gm RX (k + n − m)
2π n m
k=−∞
∞
1 XX X
= gn gm eiωk RX (k + n − m)
2π n m
k=−∞
148
Changing the variable l = k + n − m, k = l + m − n, we get

∞
1 XX X
fY (ω) = gn gm eiω(l+m−n) RX (l)
2π n m
l=−∞
∞
1 XX X
= gn gm eiωm e−iωn eiωl RX (l)
2π n m
l=−∞
XX
iωm −iωn
= gn gm e e fX (ω)
n m
= G(eiω )G(e−iω )fX (ω).
It remains to notice that
X X X
G(e−iω ) = gn e−iωn = gn eiωn = gn eiωn = G(eiω )
n n n
and
G(eiω )G(eiω ) = |G(eiω )|2 .
2. Verification of (1.19) and (1.20) (sketch). Let
Xt + a1 Xt−1 + a2 Xt−2 = εt
be an AR(2) process. Suppose the equation
z 2 + a1 z + a2 = 0
has a pair of complex roots z1,2 = re±iθ . We have r < 1 from the stationarity
condition. Also,
z 2 + a1 z + a2 = (z − z1 )(z − z2 )
which implies
(1.21) a1 = −2r cos(θ), a2 = r2 .
Assume that (1.18) is satisfied. Then ωmax exists and | cos(ωmax )| < 1. Now,
according to (1.17), the spectrum reaches its maximum when
(1.22) 2a1 (1 + a2 ) cos(ω) + 2a2 cos(2ω)
reaches its minimum. With help of (1.21), this becomes
(1.23) − 4r(1 + r2 ) cos(θ) cos(ω) + 2r2 cos(2ω).
Setting the derivative to zero, we get an equation
4r(1 + r2 ) cos(θ) sin(ω) − 4r2 sin(2ω) = 0
which is equivalent to the equation
sin(ω)((1 + r2 ) cos(θ) − 2r cos(ω)) = 0
So, we have three solutions to the equation, 0, π and the one is given by (1.19).
However, the first two solutions correspond to local maxima (second derivative
turns negative). Indeed, second derivative of (1.23) is equal to
4r(1 + r2 ) cos(θ) cos(ω) − 8r2 cos(2ω)
and its values at zero and at π are given by the formula
±4r(1 + r2 ) cos(θ) − 8r2 = 4r(±(1 + r2 ) cos(θ) − 2r) < 0
149
by (1.18). On the other hand, (1 + r2 ) cos(θ) = 2r cos(ωmax ) and therefore the

second derivative at ωmax equals
8r2 (cos2 (ωmax ) − cos(2ωmax )) = 8r2 (1 − cos2 (ωmax )) > 0.
To get (1.20), we have to use (1.15) and (1.17). Namely, we have to plug (1.21)
and (1.19) into (1.17) and do some messy simplifications.
Exercises
1. Verify (1.17).
For Problems 2-12, compute and graph the spectrum of the corresponding sta-
tionary processes (simplify your answer down to real numbers, things like eiωk are
not allowed; you may find (1.16) and (1.17) helpful). Assume σε2 = 1.
2. Xt = 0.7Xt−1 + εt
3. Xt = 0.9Xt−1 + εt + 0.9εt−1
4. Xt = 1.35Xt−1 − 0.66Xt−2 + εt
5. Xt = 0.3Xt−1 + 0.54Xt−2 + εt
6. Xt = −0.05Xt−1 + 0.9Xt−2 + εt
7. Xt = 1.6Xt−1 − 0.94Xt−2 + εt
8. Xt = 1.7Xt−1 − 0.8Xt−2 + εt − 1.9εt−1 + 0.95εt−2
9*. (1 − 1.4B + 0.75B 2 )(1 + 0.9B)Xt = εt
10*. (1 − 1.9B + 0.95B 2 )(1 + 1.7B + 0.9B 2 )Xt = εt
11**. (1 + 1.35B + 1.175B 2 + 0.6375B 3 )Xt = εt
12**. (1 + 0.028B + 1.06B 2 − 0.11B 3 + 0.21B 4 )Xt = (1 + 0.24B + 0.99B 2 )εt
2. Periodogram and its Properties

Now suppose that we have a stationary time series X1 , . . . , XN . Looking at the
definition of the spectral density, it is natural to try
N −1
1 X
IX (ω) = (R̂(0) + 2 R̂(k) cos(kω))
2π 1
(2.1) N −1
1 X
= ( R̂(k)eiωk )
2π
k=−(N −1)
as an estimate for it. The function IX plays a central role in the spectral estimation
though it is not a good estimate by itself. It is called the periodogram of the time
series X.
To simplify our computations, we assume that the expectation of the series Xt
is known and is equal to zero (otherwise, we have to estimate it and subtract it
from the data). Also, we assume that the number of observations N is even (this
simplifies things a bit).
We begin with finding an alternative formula for the periodogram. It is conve-
nient to set Xt = 0 for all t 6= 1, . . . N and R̂(k) = 0 for all k such that |k| ≥ N .
Then
∞
1 X
R̂(k) = Xt Xt+k
N t=−∞
150
Periodogram
1.2
1.0
0.8
0.6
0.4
0.2
Frequency
π/4 π/2 3π/4 π
Figure 8. Periodogram of a white noise, N = 128.
and we have
∞
!
1 X
iωk
IX (ω) = R̂(k)e
2π
k=−∞
∞ ∞
!
1 1 X
X
iωk
= Xt Xt+k e
2π N t=−∞
k=−∞
1 1 X X
(2.2) = Xt e−iωt Xt+k eiω(k+t)
2π N t
k
1 1 X −iωt
X
= Xt e Xs eiωs
2π N t s
N
1 1 X
= | Xt e−iωt |2
2π N t=1
(we did a substitution s = t + k, recalled that e−iωt is a complex conjugate to eiωt ,

and, finally, noticed that Xt = 0 if t is out of range).
The periodogram I(ω) can be computed for all ω. However, it is usually consid-
ered only for ωp = 2πp N where p = 0, 1, . . . , N/2 (remember, N is even). Frequencies
ωp are called principal frequencies. There are two reasons to do that. First, random
variables I(ω0 ), . . . , I(ωN/2 ) are independent (at least, asymptotically) and their
distribution can be found. Second, values of the periodogram for other frequencies
do not contain any additional information about the data.
Since R̂(k) are consistent and asymptotically unbiased estimates for the auto-
covariance function R(k), we may expect the periodogram to be an asymptotically
unbiased and consistent estimate for the spectral density. Unfortunately, the second
part is not true (it is not consistent). Indeed, look at the graphs of the periodogram
of the white noise (Figures 8, 9, 10 and 11, sample sizes are 128, 512, 2,048 and
8,192). Its spectral density is a constant, but the periodogram does not show any
signs of becoming a constant.
We begin with some trig identities. Let ωp = 2πp/N be one of the principal
frequencies. Then
N
X N
X
(2.3) cos(ωp t) = sin(ωp t) = 0
t=1 t=1
151
Periodogram
0.8
0.6
0.4
0.2
Frequency
π/4 π/2 3π/4 π
(see Exercise 7 in Appendix C, Section 2).

With the help of (2.3) and trig identities for the products of trig functions, one
can show that
N
X
cos(ωp t) sin(ωq t) = 0
t=1
N
X
cos(ωp t) cos(ωq t) = 0 if p 6= q
t=1
N
X
sin(ωp t) sin(ωq t) = 0 if p 6= q
t=1
N
X
(2.4) cos2 (ωp t) = N/2 if p 6= 0, N/2
t=1
N
X
cos2 (ωp t) = N if p = 0 or N/2
t=1
N
X
sin2 (ωp t) = N/2 if p 6= 0, N/2
t=1
N
X
sin2 (ωp t) = 0 if p = 0 or N/2
t=1
(see Exercise 8 in Appendix C, Section 2).

Let ep , fp be vectors in a N -dimensional Euclidean space RN with coordinates
ep = (cos(ωp ), cos(2ωp ), . . . , cos(N ωp ))

fp = (sin(ωp ), sin(2ωp ), . . . , sin(N ωp ))
Note that f0 = fN/2 = 0 and the remaining N vectors f0 , . . . , fN/2 , e1 , . . . , eN/2−1

are orthogonal to each other because of (2.4). Hence they form an orthogonal
basis in RN and every vector in RN can be uniquely represented as their linear
combination. Moreover, the corresponding coefficients can be found in terms of
scalar products.
152
Periodogram
1.5
1.0
0.5
Frequency
π/4 π/2 3π/4 π
Periodogram
1.4
1.2
1.0
0.8
0.6
0.4
0.2
Frequency
π/4 π/2 3π/4 π
We are especially interested in the vector X = (X1 , . . . , XN ). We have

N/2 N/2−1
X X
(2.5) X= ap ep + bp fp
p=0 p=1
where the coefficients ap , bp are given by the formula

N
2 X
ap = Xt cos(ωp t) if p 6= 0, N/2
N t=1
N
2 X
bp = Xt sin(ωp t) if p 6= 0, N/2
N t=1
(2.6)
N N
1 X 1 X
a0 = Xt cos(ω0 t) = Xt
N t=1 N t=1
N N
1 X 1 X
aN/2 = Xt cos(ωN/2 t) = (−1)t Xt
N t=1 N t=1
In terms of a series Xt , (2.5) can be rewritten as follows;

N/2−1
X
Xt = a0 + [ap cos(ωp t) + bp sin(ωp t)] + aN/2 (−1)t
p=1
153
On the other hand, in terms of a periodogram,

N
1 1 X
IX (ωp ) = | Xt e−iωp t |2
2π N t=1

N
!2 N
!2 
1 1  X X
(2.7) = Xt cos(ωp t) + Xt sin(ωp t) 
2π N t=1 t=1
(
N 2 2
(ap + bp ) if p 6= 0, N/2
= 8πN 2
2π ap otherwise
Formula (2.7), along with the formulas for the coefficients ap , bp , allows us to
study a simple example.
Periodogram of a white noise. Suppose Xt = εt is a Gaussian white noise,
with zero expectation and variance σε2 . The coefficients ap , bp given by (2.6), are
linear combinations of independent Gaussian random variables and therefore they
have a multivariate Gaussian distribution. Since the expectation of Xt is equal to
zero, the expectations of ap and bp are also equal to zero. Let p 6= 0, N/2. Since
Xt are independent, we have
N
4 X 2
E(a2p ) = Var ap = 2
cos2 (ωp t)σε2 = σε2
N t=1 N
N
4 X 2 2
E(b2p ) = Var bp = 2 sin (ωp t)σε2 = σε2
N t=1 N
N N
4 XX
E(ap bp ) = Cov(ap , bp ) = cos(ωp t) sin(ωp s) Cov(Xt , Xs )
N 2 t=1 s=1
N
4 X
= cos(ωp t) sin(ωp t)σε2 = 0
N 2 t=1
In a similar way, we can check that Cov(ap , aq ) = Cov(ap , bq ) = Cov(bp , bq ) = 0 if
p 6= q.
In particular, ap and bp are independent Gaussian random variables with zero
mean and variance N2 σε2 . Therefore
N 2
(a + b2p )
2σε2 p
has a χ2 distribution with two degrees of freedom. In a similar way, we can find
that
N 2 N 2
a0 , and a
σε2 σε2 N/2
have χ2 distribution with one degree of freedom.
Now, if we take into account the formula (2.7) and the fact that the spectral
1 2
density of the white noise is a constant 2π σε , we get the following result (for a
white noise).
Proposition 1. Suppose Xt is a white noise. Random variables
I(ωp )
ζp = , p = 0, . . . , N/2
fX (ωp )
154
are independent. If p is not zero or N/2, then ζp has 12 χ2 (2) distribution. Other-
wise, it has a χ2 (1) distribution.
In particular, we see from here that the periodogram is an unbiased estimate for
the spectral density (at least for the principal frequencies). However, the variance
of the periodogram does not depend on N and does not go to zero as N → ∞.
Therefore the periodogram is not a consistent estimate.
White noise test. The established result allows us to construct a white noise
test. Consider
I(ωp )
W = sup 2
p=1,...,N/2−1 σ̂X /2π
If the series is a white noise, then
N/2−1
Y
P {W ≤ z} = P {ζp ≤ z} ≈ (1 − e−z )N/2−1
p=1
(because 12 χ2 (2) is actually an exponential distribution with parameter 1; it is

an approximate statement because we have replaced the actual variance by its
estimate). Therefore, it remains to choose z in such a way to have the left side to
be equal to, say, 95 % (or 1 − α) if we want to use α as a level of significance). If
W exceeds that level, we reject the hypothesis, otherwise we accept it.
In order to investigate a more general situation, we establish the following

approximation.
Suppose two series Xt and Yt are related to each other by the equation
m
X
Yt = gk Xt−k
k=0
where m is small as compared to the sample size N . Then
(2.8) IY (ω) ≈ |G(eiω )|2 IX (ω)
gk z k (compare with (1.14)). Indeed, note that

P
where G(z) = k
N N
1 X 1 X
√ Xt eiωt ≈ √ Xt−k eiω(t−k)
N t=1 N t=1
if k is small (the sums actually differ by a few terms). Therefore,
N N
1 X X
Xt−k eiω(t−k) Xt−l e−iω(t−l) ≈ IX (ω)
2πN t=1 t=1
155
if both k and l are small. With this in mind, we have

1 X
IY (ω) = | Yt eiωt |2
2πN t
1 X X
= Yt eiωt Ys e−iωs
2πN t s
1 XX XX
= gk e Xt−k eiω(t−k)
iωk
gl e−iωl Xt−l e−iω(t−l)
2πN t s
k l
" #
XX
iωk −iωl 1 X iω(t−k)
X
−iω(t−l)
= gk e gl e Xt−k e Xt−l e
2πN t s
k l
XX
≈ gk eiωk gl e−iωl IX (ω)
k l
= G(eiω )G(e−iω )IX (ω) = |G(eiω )|2 IX (ω).
From (2.8), (1.14) and the Proposition 1, we get the following
Proposition 2. Suppose Xt is a Gaussian ARMA(k, l) process. Random
variables
IX (ωp )
ζp = , p = 0, . . . , N/2
fX (ωp )
are asymptotically independent. If p is not zero or N/2, then ζp has (asymptotically)
1 2 2
2 χ (2) distribution. Otherwise, it has (asymptotically) a χ (1) distribution.
In fact, this statement remains valid for any stationary series which has a
continuous spectral density, but we will not justify it.
Periodogram and discrete Fourier transform. Let X1 , . . . , XN be a finite
sequence of real numbers. Its discrete Fourier transform is defined by the formula
N
X
(2.9) FX (k) = Xn · e−i2πkn/N
n=1
The new (complex-valued) sequence is periodic with period N . Indeed,

N
X N
X
FX (k + N ) = Xn · e−i2π(k+N )n/N = Xn · e−i2πkn/N e−i2πn
n=1 n=1
−i2πn
and e = cos(2πn) − i sin(2πn) = 1 for integer n. The original sequence
Xn , n = 1, . . . , N could be restored by the formula
N
1 X
(2.10) Xn = FX (k) · ei2πkn/N
N
k=1
(so called inverse discrete Fourier transform). Also, since Xn is real-valued, FX (−k)
coincides with complex conjugate of FX (−k).
There exist efficient algorithms that allow us to compute FX (so called fast
Fourier transform).
In terms of discrete Fourier transform, the periodogram of a time series Xt , t =
1, . . . , N could be written as
2πp 1 1
IX ( )= |FX (p)|2 , p = 0, 1, . . . , N/2
N 2π N
156
Periodogram
3000
2500
2000
1500
1000
500
Periods
80 11 5.5 3 2
Figure 12. Periodogram of the sunspots data.
Periodogram
1000
100
10
0.1
Periods
80 11 5.5 3 2
Figure 13. Periodogram of the sunspots data in the logarithmic scale.
Spectrum
10
1
0.100
0.010
0.001
Frequency
π/4 π/2 3π/4 π
Figure 14. Periodogram and the spectrum of the series Xt =

0.9Xt−1 + εt in log scale.
where FX is the discrete Fourier transform of the sequence Xt . In fact, real and
imaginary parts of FX (p) are proportional to the coefficients ap and bp given by
(2.6).
Examples and Further Comments. 1. On the following Figures 12 - 14,
you can see a periodogram of the sunspots data, and a periodogram of a series
Xt = 0.9Xt−1 + εt , together with its spectrum.
2. Sometimes, the periodogram reveals a presence of periodic components or
even the structure of the data. On Figure 15, you can see a mysterious data set.
Looking at its ACF and PACF, we may try different ARMA models (AR(4)? or
157
Data
Time
50 100 150 200 250 300 350
-2
-4
Figure 15. Mysterious Harmonics data.
ACF
1.0
0.5
Lags
1 5 10 15 20 25 30
-0.5
-1.0
Figure 16. ACF of the Harmonics data.
PACF
1.0
0.5
Lags
1 5 10 15 20 25 30
-0.5
-1.0
Figure 17. PACF of the Harmonics data.
maybe MA(3)? or something mixed?) but they don’t really work. However, the
periodogram reveals the secret: It is a mixture of six periodic components (yes, just
that with no noise).
3. Another example of this type was given in the introduction (human brain
example, Figures 7 and 8 in the Introduction). Though we can’t see them in the
noise and all other brain activity, the series contains two periodic components (one
is the main electric power frequency and the other one is a resonance).
Exercises
158
Periodogram
15
10
Periods
30 10 5 4 3 2
Figure 18. Periodogram of the Harmonics data.
1. For the following data set (will be posted on the web, 30 data points) com-
pute and graph the periodogram. Reminder: the periodogram should be computed
only for the principal frequencies. Suggestion: you may find the alternative formula
(2.2) to be the easiest.
2
2. For a series of length 400, estimated variance σ̂X = 3.96 and the maximum
of the periodogram equals 4.91. Is a series a white noise? What if the maximal
value of the periodogram is 6.35? Use 5 % level of confidence.
3. Estimation of a spectrum. Properties of the estimates

There exist two principal ideas that could be used to construct a consistent
estimate of the spectral density.
1. As we have seen in Section 2, the periodogram IX (ω) turned out to be a
highly oscillating function. However,
N −1
1 X
IX (ω) = (R̂(0) + 2 R̂(k) cos(kω))
2π 1
is a linear combination of cosine functions. The bigger is k, the smaller is the period
of the corresponding component. Hence, IX oscillates heavily because too much
weight has been given to the highly oscillating components, the ones with large ks.
For this reason, we may consider an estimate of the following structure
N −1
1 X
(3.1) fˆ(ω) = (R̂(0) + 2 λk R̂(k) cos(kω))
2π 1
where the coefficients λk , called the lag window, should somehow decrease as k
grows. Typically, we look for the coefficients of the following structure:
(
k
λ( M ), 0≤k≤M
(3.2) λk =
0 k>M
The function λ(·) is called a window generator, and M is the truncation point (it is
often called the window width). The function λ(x) should be decreasing, continuous
at zero, such that λ(0) = 1, λ(1) = 0. Note that λk → 1 as M → ∞. So, it looks
like the parameter M should go to infinity as N → ∞, otherwise we can’t achieve
the consistency.
159
2. We could mollify the periodogram instead. To this end, define a convolution

of the functions f and g , denoted f ? g, by the formula
Z ∞
f ? g(x) = f (y)g(x − y) dy
−∞
Suppose g(x) = 0 outside of the interval (−C, C) and g(x) = 1/2C otherwise. Than
R ∞interval (x − C, x + C). If the
f ? g(x) is the average value of the function f on the
function g is not a constant (but still the integral −∞ g(x) dx = 1), then f ? g(x)
could be considered as a weighted average of the values of f .
For that reason, a standard way to mollify a function f is to take its convolution
with another,
R smooth function g which is supposed to be even and satisfy the
condition g = 1. It could be shown that, if M → ∞ and gM (x) = M g(M x), then
f ? gM converges to f in a certain sense. (If the function g vanishes outside [−1, 1],
then gM vanishes outside [−1/M, 1/M ], so we are talking about weighted average
over small intervals.)
With this in mind, we may consider an estimate of the form
Z ∞
(3.3) fˆX (ω) = w(ω − θ)IX (θ) dθ = IX ? w(ω)
−∞
Here IX (ω) is treated as an even function with period 2π and an even function
w(ω) is the so-called spectral window satisfying
Z ∞
w(ω)dω = 1
−∞
Again, often (but not always) we look for a spectral window of the form
(3.4) w(ω) = wM (ω) = M W (M ω)
where M plays the role similar to that in case of lag windows, and W (ω) is the
spectral
R window generator, typically bell-shaped with compact support and with
W (ω)dω = 1. When M grows, the function w(ω) becomes more and more con-
centrated around zero.
Assuming M is large and the function W vanishes outside of some neighborhood
of zero, we can replace limits of integration by −π and π (we have some trouble
near the end points).
In a certain sense, those two approaches are nearly equivalent. Indeed, suppose
that the estimate is given by (3.1) with lag window λk . Then
Z ∞
(3.5) ˆ
fX (ω) = w(ω − θ)IX (θ) dθ = IX ? w(ω)
−∞
where
N −1
1 X
(3.6) w(z) = λk e−ikz
2π
k=−(N −1)
which is similar to (3.3), with no agreement about extending the integration to the
whole real line (so we have sort of end-effects near −π and π).
The other way around, suppose that fˆ is defined by (3.3) with the spectral
window w(ω). Then
N −1
1 X
(3.7) fˆ(ω) = (R̂(0) + 2 λk cos(kω)R̂(k))
2π
k=1
160
where
Z ∞
(3.8) λk = w(z) cos(kz) dz.
−∞
(see details below).
In fact, if the spectral window w is generated by the window generator W (ω),
then the corresponding lag window is generated by the lag window generator
Z ∞
λ(x) = W (z)eixz dz
−∞
with the same parameter M . Conversely, the inverse formula for the Fourier trans-
form implies that Z ∞
1
W (z) = λ(x)e−ixz dx.
2π −∞
However, we can’t claim that the lag window generator vanishes outside of the
interval [−1, 1]; the other way around, if the lag window generator vanishes outside
[−1, 1], then the corresponding spectral window generator does not have compact
support. For this reason, these two approaches are complimentary to each other.
Since the spectral density is always non-negative, it is natural to require that its
estimate is also non-negative. This explains a certain advantage of the spectral win-
dows: If w(z) is non-negative, then the estimate fˆ(ω) automatically non-negative
(since the periodogram is non-negative, we are integrating a non-negative function
in (3.3)). A condition on the lag window which guarantees non-negativity of the
estimate, is more tricky and not natural (essentially, it says that the corresponding
spectral window is non-negative).
Many dozens of windows have been discussed. A few examples.
1. Truncated periodogram. It is defined by the lag window
(
1 if |k| ≤ M
λk =
0 otherwise
Corresponding spectral window is equal to
1 sin((M + 1/2)θ)
w(θ) =
2π sin(θ/2)
(it is called the Dirichlet kernel. It is not non-negative, so the corresponding esti-
mate may be negative somewhere).
2. Bartlett, or triangular, window. Again, a lag window given by the
formula
(
1 − |k|
M if |k| ≤ M
(3.9) λk =
0 otherwise
It corresponds to a non-negative spectral window
2
1 sin(M θ/2)
(3.10) w(θ) =
2πM sin(θ/2)
(so called Fejer kernel of the order M ).
3. Rectangular, or Daniell, window. This is a spectral window of the form
(
M π
if |θ| ≤ M
w(θ) = 2π
0 otherwise
161
It is equivalent to the lag window

sin(πk/M )
λk =
πk/M
Note that there is no truncation point here, the window does not vanish beyond
M.
4. Tukey window. It is a lag window of the form
(
1 − 2a + 2a cos(πk/M ) if |k| ≤ M
λk =
0 otherwise
where 0 < a ≤ 1 is a parameter, typically set to 0.25 (so called Tukey-Hanning win-
dow). The corresponding spectral window is a linear combination of the Dirichlet
kernels (see example 1) and it is not non-negative.
5. Parzen window. Again, a lag window with the window generator given
by the formula

2
1 − 6x + 6|x| ,
 3
|x| ≤ 12
1
λ(x) = 2(1 − |x|)3 , 2 ≤ |x| ≤ 1

0 |x| > 1

The weights λk could be found from (3.2). The function λ(x) is twice differentiable
(as we’ll see, this is an advantage) and the corresponding spectral window is non-
negative.
6. Bartlett-Priestley window. This is a spectral window with the window
generator
( h i
3 θ2
1 − 2 if |θ| ≤ π
W (θ) = 4π π
0 otherwise
(remember, it has to be re-scaled with the window width M ).
How can we compare the windows, that is, how can we compare the corre-
sponding estimates? Also, how should we choose the window width M ? In order
to be able to say something about it, we have to study the bias and the variance
of the estimates.
Since the lag window and the spectral window approaches are (nearly) equiva-
lent, we consider an estimate (3.3) with the spectral window generator W (ω). We
assume that it is even and non-negative and it satisfies the condition
Z
W (θ) dθ = 1
In addition, we assume that

Z
W 2 (θ) dθ < ∞
Also, we will assume that the spectral density of the process fX (ω) is twice con-
tinuously differentiable. If this is the case, one can show that the bias b(ω) of the
estimate is approximately equal to
f 00 (ω) π 2
Z
(3.11) b(ω) = E fˆ(ω) − f (ω) ≈ X 2 θ W (θ) dθ
2M −π
162
AR4 Data
50
0 Time
600 650 700 750
-50
Figure 19. Simulated AR(4) process Xt − 3.3Xt−1 + 4.55Xt−2 −

3.06Xt−3 + 0.855Xt−4 = εt .
The variance of the estimate, v 2 (ω), could be found from the formula
M π
Z
(3.12) 2 ˆ 2
v (ω) = Var f (ω) ≈ 2πfX (ω) W 2 (θ) dθ
N −π
Both formulas hold asymptotically as N → ∞, M → ∞ and M 2 /N → 0.

Derivation of (3.11) and (3.12) is very involved. Some sketch could be found
at the end of the section.
Discussion. 1. Consistency conditions and the role of the window
width. We can conclude from (3.11) and (3.12) that the estimate is consistent if
M → ∞ but, at the same time, M 2 /N → 0 as N → ∞. This, still, gives us a lot
of flexibility (say, M = N α with 0 < α < 1/2 works perfectly). Later we will find
out that M = const. N 1/5 is, in a certain sense, optimal.
On the other hand, (3.11) and (3.12) imply that, the bigger is M , the smaller is
the bias and, at the same time, the bigger is the variance. If M is way too small, the
estimate is very smooth even if the actual spectrum contains sharp peaks. If M is
too large, the estimate behaves similar to a periodogram. On practice, choice of M
is a subjective thing. To illustrate that, we use a simulated AR(4) process (Figure
19) described by the equation Xt −3.3Xt−1 +4.55Xt−2 −3.06Xt−3 +0.855Xt−4 = εt .
Its spectral density contains two sharp peaks that correspond to periods 20 and 9.05.
On Figures 21 - 23, you can compare the actual spectrum and its estimates (Parzen
window) constructed with different M (N = 2000). If we use Bartlett window
instead, the picture would be different.
2. Confidence intervals. In order to be able to construct confidence
intervals for the spectrum, we need to know more about the distribution of the
estimate. It is definitely far from normal, so standard tricks with ±2σ won’t work.
Since random variables I(ωp )/f (ωp ) are independent and have the χ2 distribution,
it looks like the χ2 distribution could be a reasonable guess. (Recall that a sum
of independent χ2 random variables is again a χ2 distributed random variable).
However, χ2 distribution has a parameter (degrees of freedom). Comparing the
expectation and variance of the estimate with those of the χ2 distribution, we may
suggest that
fˆX (ω) 1
≈ χ2 (ν)
fX (ω) ν
163
Periodogram
6000
5000
4000
3000
2000
1000
0 Periods
20 9.05 5 3 2
Figure 20. Periodogram of the AR(4) process Xt − 3.3Xt−1 +

4.55Xt−2 − 3.06Xt−3 + 0.855Xt−4 = εt .
Spectrum
1400
1200
1000
800
600
400
200
Periods
20 9.05 5 3 2
Figure 21. Spectrum of the AR(4) process Xt − 3.3Xt−1 +

4.55Xt−2 − 3.06Xt−3 + 0.855Xt−4 = εt and its estimate with
M = 25, Parzen window. Obviously, M is way too small, sec-
ond peak just disappeared.
Spectrum
1400
1200
1000
800
600
400
200
Periods
20 9.05 5 3 2
Figure 22. The same thing with M = 100. Now we can see both
peaks (though the peak on the right is not as tall as it should be.
where
N
ν= R .
πM W 2 (ω)dω
164
Spectrum
1400
1200
1000
800
600
400
200
Periods
20 9.05 5 3 2
Figure 23. The same thing with M = 250. Peaks are perfect,
but the estimate is no longer a smooth function, so M is probably
a bit too large.
Spectrum
1000
100
10
0.1
Periods
80 11 5.5 3 2
Figure 24. Confidence interval for a spectrum of the sunspots

data, together with its periodogram.
The value for ν comes from the equation

fˆX (ω)
Z
1 2 2 M
Var χ (ν) = = 2π W 2 (θ) dθ = Var
ν ν N fX (ω)
The number ν is called the equivalent degrees of freedom. With this in mind, we
can find the percentiles for the χ2 (ν) distribution, say aν and bν such that
ν fˆX (ω)
P {aν ≤ ≤ bν } = .95
fX (ω)
(or .99 or any other). Solving for f , we see that
ν fˆX (ω) ν fˆX (ω)
≤ fX (ω) ≤
bν aν
with probability .95, at least approximately. On Figure 24, we can see a confidence
interval for a spectrum of the sunspots series.
However, the constructed interval will not work if the bias is substantial. To
illustrate that, we use a simulated AR(2) process Xt − 1.4Xt−1 + 0.98Xt−2 = εt . Its
spectrum contains a very narrow peak, and if the parameter M is not sufficiently
large, the bias is substantial and the confidence interval is way off the
R π mark.
3. Comparison of the windows. We have two coefficients −π θ2 W (θ) dθ
Rπ
and −π W 2 (θ) dθ which depend only on the spectral window and, in a sense,
165
Spectrum
1000
100
10
1
0.1
Periods
30 10 5 4 3 2
Figure 25. Spectrum of the process Xt −1.4Xt−1 +0.98Xt−2 = εt

and a confidence interval for it, constructed with inappropriate
window width (M = 25, sample size 2000, Parzen window).
Spectrum
1000
100
10
1
0.1
Periods
30 10 5 4 3 2
Figure 26. The same thing with M = 100 looks much better.
describe its properties. However, the smaller is theR first of them, the bigger is the
second and the other way around. Indeed, since W (θ) dθ = 1, the first of the
integrals is small if W is concentrated near zero. But then, W must be large in
a neighborhood of zero and the integral of W 2 must be also large. The following
table confirms this conclusion.
window bias × M 2 /f 00 (ω) variance × N/[M f 2 (ω)]
π2
Daniell 6 ≈ 1.645 1
Parzen 6 0.539285
π2
Tukey 4 ≈ 2.467 3/4
π2
Bartlett-Priestley 10 ≈ 0.987 6/5
4. Bandwidth. A typical goal of the spectral analysis is an interpretation
and/or explanation of the peaks/troughs in the spectrum. Hence, the goal of spec-
tral estimation is to estimate the shape of the spectrum rather than estimate a
value for any particular frequency.
As we see from (3.11), the bias of the estimate is proportional to the second
derivative of the spectral density. If the spectrum is nearly flat, the bias is small,
if it has sharp peaks or narrow troughs, then the bias is large at those points.
Suppose the spectral density has a peak at the frequency ω0 . Consider the
points ω1 < ω0 < ω2 such that f (ω1 ) = f (ω2 ) = 21 f (ω0 ) (so called half-power
points). The distance Bh (ω0 ) = ω2 − ω1 is called the spectral bandwidth of the
166
The Peak
800
600
400
200
Frequency
ω1 ω2 π/2 π
Figure 27. Half-power points for a peak
The Trough
1000
800
600
400
200
Frequency
ω1 ω2 π/2 π
Figure 28. One-and-a-half-power points for a trough
peak (see Figure 27). Replacing the density by a parabola

1
f (ω) ≈ f (ω0 ) + (ω − ω0 )2 f 00 (ω0 )
2
we can find
f (ω0 ) 1/2

ω1,2 ≈ ω0 ± 00
f (ω0 )
and therefore
f (ω0 ) 1/2

(3.13) Bh (ω0 ) ≈ 2 00

f (ω0 )
In a similar way, we can define a bandwidth of a spectral trough (we use one-and-
a-half-power points instead of the half-power points; see Figure 28). However, a
spectrum may contain several peaks and/or troughs. To reflect that, we define an
overall spectral bandwidth as a minimum,
f (ω) 1/2

(3.14) Bh = 2 inf 00

ω f (ω)
What happens to the spectral density when we estimate it by using any partic-
ular window? We begin with the example. Let w(ω) be the Daniel, or rectangular,
window w(ω) = 2πM π
on (− M ,Mπ
). The estimate fˆ(ω) is then the average of the
periodogram over the interval of the width 2π
M . Assuming that the expectation of
the periodogram is approximately equal to the spectral density, and ignoring the
167
random component (the moving average is supposed to reduce it significantly), we

see that two peaks have a chance to glue together (or the trough may disappear)
if the window covers both of them. None of this may happen if the width of the
window is smaller than the bandwidth, say if 2π 1
M ≤ 2 Bh . Hence, if we expect a
4π
spectrum to have a bandwidth Bh , we should have M ≥ B h
. It is natural to call
2π
M the bandwidth of the window.
However, if the window is not rectangular, then it is not clear how to define
its width. After a long discussion, the following definition was suggested. Let
w(ω) = M W (M ω) be a spectral window related to the window generator W (ω).
Its window width is defined as BW /M where
Z ∞ 1/2
√
(3.15) BW = 12 ω 2 W (ω) dω
−∞
√
depends on the window generator. The constant 12 has been chosen to be con-
sistent with the width of the Daniell window.
5. Bandwidth vs autocovariance function. Differentiating the series
∞
1 X
f (ω) = (R(0) + 2 R(k) cos(kω)),
2π
k=1
we get
∞
2 X 2
f 00 (ω) = − k R(k) cos(kω),
2π
k=1
So, if R(k) decays rapidly, f 00 is relatively small and the bandwidth is large. Oth-
erwise, bandwidth is small and the spectrum contains sharp peaks and/or narrow
troughs.
Details. 1. Derivation of (3.7). Indeed, (1.4) implies that
Z π
R̂(k) = IX (θ) cos(kθ)dθ
−π
Z π
= IX (θ)eiθk dθ
−π
Therefore
N −1
1 X
fˆ(ω) = (R̂(0) + 2 λk R̂(k) cos(kω)
2π 1
N −1
1 X
= λk R̂(k)e−iωk
2π
k=−(N −1)
N −1 Z π
1 X
= λk e−iωk IX (θ)eiθk dθ
2π −π
k=−(N −1)
 
Z π N −1
1 X
=  λk e−i(ω−θ)k  IX (θ)dθ
−π 2π k=−(N −1)
Z π
= w(ω − θ)IX (θ)dθ
−π
168
where w is given by (3.6).

R ∞ 2. Derivation of (3.5). We assume now that w(ω) is an even function with
−∞
w(ω) dω = 1. We have
Z ∞
fˆ(ω) = w(ω − θ)IX (θ) dθ
−∞
Z ∞ N −1
1 X
= w(ω − θ) ( R̂(k)eikθ ) dθ
−∞ 2π
k=−(N −1)
N −1 Z ∞
1 X
ik(θ−ω)
= w(ω − θ)e dθ eiωk R̂(k)
2π −∞
k=−(N −1)
N −1 Z ∞
1 X
ikz
= w(z)e dz eiωk R̂(k)
2π −∞
k=−(N −1)
N −1
1 X
= λk eiωk R̂(k)
2π
k=−(N −1)
where Z ∞
λk = w(z)eikz dz.
−∞
Since w is even, Z ∞
w(z) sin(kz) dz = 0
−∞
and therefore Z ∞
λk = λ−k = w(z) cos(kz) dz.
−∞
Also, Z ∞
λ0 = w(z) dz = 1
−∞
Since R̂(k) is also even, we get

N −1
1 X
fˆ(ω) = λk eiωk R̂(k)
2π
k=−(N −1)
N −1
1 X
= (R̂(0) + (λk eiωk R̂(k) + λ−k e−iωk R̂(−k))
2π
k=1
N −1
1 X
= (R̂(0) + 2 λk cos(kω)R̂(k)).
2π
k=1
as promised.
3. Derivation of (3.11) and (3.12) (sketch). To simplify the presentation,
we assume that the expectation of the series is equal to zero and this is known to
us. We begin with the expectation of the periodogram IX (ω). By (III.2.4), we have
Z π
|k| |k|
E R̂(k) = (1 − )R(k) = (1 − ) eikθ fX (θ) dθ
N N −π
169
and therefore
 
(N −1)
1 X
EIX (ω) = E  R̂(k)e−ikω 
2π
k=−(N −1)
(N −1)
1 X
= E R̂(k)e−ikω
2π
k=−(N −1)
π (N −1)
|k|
Z
1 X
= e−ik(ω−θ) (1 − )fX (θ) dθ
2π −π k=−(N −1) N
Now, the sum

(N −1)
1 X |k|
e−ikϕ (1 − )
2π N
k=−(N −1)
can be evaluated. It is equal to the so called Fejer kernel

1 sin2 (N ϕ/2)
FN (ϕ) = .
N sin2 (ϕ/2)
Hence,
Z π
EIX (ω) = FN (ω − θ)fX (θ) dθ = FN ? fX (ω)
−π
is the convolution of the spectral density with the Fejer kernel FN . It could be
shown that EI(ω) → fX (ω) as N → ∞, so the periodogram is an asymptotically
unbiased estimator for the spectrum. However, we are interested in an expectation
of an estimate (3.3).
We have
Z ∞
ˆ
E f (ω) = E(IX (θ))wN (ω − θ) dθ
−∞
Z ∞Z π
= fX (ϕ)FN (θ − ϕ)wN (ω − θ) dϕ dθ
−∞ −π
Z π
= fX (ϕ)FN ? wN (ω − ϕ) dϕ
−π
Z π
= fX (ω − ϕ)FN ? wN (ϕ) dϕ
−π
Now,
0 ϕ2 00
fX (ω − ϕ) ≈ fX (ω) − ϕfX (ω) + f (ω)
2 X
Hence the bias
Z π
b(ω) = E fˆX (ω) − fX (ω) = fX (ω − ϕ)FN ? wN (ϕ) dϕ − fX (ω)
−π
Z π
= (fX (ω − ϕ) − fX (ω))FN ? wN (ϕ) dϕ
−π
π
ϕ2 00
Z
0
≈ (−ϕfX (ω) + f (ω))FN ? wN (ϕ) dϕ
−π 2 X
170
Finally, it could be seen that

Z π
ϕFN ? wN (ϕ) dϕ = 0
−π
(since both kernels FN and wN are even). Also, it could be shown that
Z π Z π
1
ϕ2 FN ? wN (ϕ) dϕ ≈ 2 ϕ2 W (ϕ) dϕ
−π M −π
(asymptotically as M → ∞ and M 2 /N → 0). This way, we get (3.11).

Along the same lines, we can establish (3.12) for the variance of the estimate.
We begin with
v 2 (ω) = Var fˆ(ω) = Cov(fˆX (ω), fˆX (ω))
Z π Z π
= Cov(IX (θ1 ), IX (θ2 ))WN (ω − θ1 )WN (ω − θ2 ) dθ2 dθ1
−π −π
which can be evaluated if we do know the covariance of IX (θ1 ) and IX (θ2 ). Unfor-
tunately, corresponding derivations take several pages.
Exercises
1. For the following data set (will be posted on the web, 100 data points),
estimate and graph the spectral density (a) using the truncated periodogram with
M = 10; (b) triangular window with M = 10; (c) Parzen window with M = 10.
2. Repeat the same with M = 20. Which of the estimates (out of both
problems) looks more reasonable to you?
4. Estimation Details
Precision of the estimate and comparison of the windows. One of
the possible measures of the precision of the estimate is called the mean square
percentage error. It is defined by the formula
h i
(4.1) η 2 (ω) = E{fˆ(ω) − f (ω)}2 /f 2 (ω) = {v 2 (ω) + b2 (ω)}/f 2 (ω)
Denote Z ∞
IW = 2π W 2 (ω) dω
−∞
and
∞ 1/2
√
Z
BW = 12 ω 2 W (ω) dω
−∞
According to (3.11) and (3.12),
M
v 2 (ω) = fX
2
(ω) IW
N
and
f 00 (ω) BW
2
b(ω) =
2M 2 12
Hence
M 1 BW4
(f 00 (ω))2 M BW 4
η 2 (ω) = IW + = I W +
N 4M 4 144f 2 (ω) N 576Bh (ω)4 M 4
171
where Bh (ω) is the spectral bandwidth defined by (3.13). Since the overall spectral
bandwidth Bh is the minimum of Bh (ω), we have
4
M BW
(4.2) max η 2 (ω) = IW +
ω N 576Bh4 M 4
Since M is at our disposal, let’s find the minimum of the right side in M . Differ-
entiating with respect to M , we get
4
IW 1 BW
= M −5 = 0
N 144 Bh
which yields
4/5 1/5
BW N −1/5
(4.3) M= IW
Bh 144
and, finally,
4/5
BW IW
(4.4) max η 2 (ω) ≈ 0.463N −4/5
ω Bh
The formula (4.4) gives an upper bound for the precision of the estimate. It depends
on the sample size N , on the spectral bandwidth Bh and on the spectral window.
The value BW IW characterizes the efficiency of the window (the smaller it is, the
better precision can be achieved for the same sample size N and the same spectral
bandwidth). Comparing the efficiency of the different windows, we get the following
table.
window Bandwidth variance × N/[M f 2 (ω)] efficiency
Daniell 2π 1 6.2832
Parzen 12 0.539285 6.48
Tukey-Hanning 2.45π 3/4 5.7715
Bartlett-Priestley 1.55π 6/5 5.8403
However, the Tukey-Hanning window is not non-negative (it may lead to negative
estimates of the spectral density, as we can see on Figure 32). If we limit ourselves
by the non-negative windows, we see that the Bartlett-Priestley window has the
best efficiency. In fact, it could be shown theoretically that the Bartlett-Priestley
window is most efficient in a class of non-negative spectral windows.
Solving for the sample size N , we see that
BW IW
N ≥ 0.463η −5/2
Bh
which gives us a lower bound for the sample size if we wish to achieve given precision
and are willing to handle spectral densities with the spectral bandwidth Bh or more.
Lag windows and leakage. At the early days, lag windows (truncated pe-
riodogram, triangular, Parzen etc.) were heavily used. Later, frequency windows
were suggested. Lag windows are easier to use, we need only the values of the
autocovariance function up to the lag M . In order to use the frequency window,
a periodogram is needed. As we have seen, lag windows can be transformed into
frequency windows and vice versa. However, there is an important difference. If the
lag window vanishes after some lag M , then the corresponding frequency window
does not have any truncating point, and vice versa. For instance, let’s consider
a triangular window (3.9). Corresponding frequency window (3.10) is given by a
172
80
60
40
20
-0.10 -0.05 0.05 0.10
Figure 29. Fejer’s kernel that corresponds to the triangular lag window.
Periodogram
8000
6000
4000
2000
Periods
10 9 8 7 6
Figure 30. If the spectrum (and the periodogram) contains nar-

row peaks, a situation like shown on the picture may occur. The
peak in the periodogram is now entirely within second wave, and
a small peak in the estimate will appear.
Fejer’s kernel (Figure 29). In addition to the primary peak at zero, Fejer’s window
has small secondary peaks.
The value of the estimate at a frequency ω is therefore a weighted average of
the values of the periodogram around ω with weights given by the Fejer’s kernel.
Suppose the spectrum (hence the periodogram) contains a narrow peak at the
frequency ω0 . As we move away from ω0 , the peak in the periodogram will move
from the main peak of the Fejer’s kernel to a secondary one (Figure 30), producing
a small but visible peak which does not exist in the original spectrum. Such an
effect is called a leakage. For a triangular window, the leakage is significant (Figure
31), but other lag windows (Parzen, Tukey) also have this effect, as you can see on
Figure 25 for Parzen window and on Figure 32 for Tukey window.
Differencing and pre-whitening. Suppose the actual spectral density of
the series has a heavy peak at zero. Since low frequencies correspond to long
periods, such things happen if the series is nearly non-stationary. In such a case,
a periodogram also has a sharp peak near zero. No matter which spectral window
we are using, we will get a not-so-sharp peak at zero (wide though not that high),
so that our estimate will be significantly biased in a neighborhood of zero. Here is
a trick which may help to fight off that bias. It is called differencing. Consider the
increments
Yt = Xt − Xt−1
173
Spectrum
1000
100
10
0.10
Periods
20 8.5 5 3 2
Figure 31. Theoretical and estimated spectrum for an AR(2) pro-

cess Xt − 1.4Xt−1 + 0.98Xt−2 = εt in log scale, Triangular window
with M = 25, sample size N = 2000. You can clearly see leakage
effects.
Spectrum
10
Periods
30 10 5 4 3 2
Figure 32. Theoretical and estimated spectrum for an AR(2) pro-

cess Xt − 1.4Xt−1 + 0.98Xt−2 = εt , Tukey window with M = 25,
sample size N = 2000. Note that, apart from leakage effects, the
Tukey window produces an estimate that is not non-negative.
According to (1.14),
fY (ω) = fX (ω)|1 − e−iω |2 = fX (ω)((1 − cos ω)2 + sin2 ω)
ω
= fX (ω)(2 − 2 cos ω) = fX (ω)4 sin2
2
Since sin2 ω2 has a second order zero at zero, fY should not have any peak at zero.
Hence, fY should be easier to estimate. Finally, we set
fˆY (ω)
fˆX (ω) =
4 sin2 ω2
Differencing could be considered as a special case of pre-whitening. According
to the method, we are looking for a transformation
Yt = α(B)Xt = Xt + a1 Xt−1 + · · · + ak Xt−k
such that the spectrum of Yt is nearly flat (so, Yt is, nearly, a white noise). The
spectrum of Yt should be easy to estimate (since it is flat, there is little or no bias,
we can use small M and therefore have a small variance). However, according to
174
Spectrum
5
0.500
0.050
0.005
Periods
30 10 5 4 3 2
Figure 33. Two estimates for the spectrum of concentration data.

As we can see, they practically coincide everywhere except the low
frequencies. The one with sharp peak at zero is obtained by dif-
ferencing.
(1.14), the spectrum of Xt could be found from the formula

1
fX (ω) = fY (ω)
|α(eiω )|2
Example. For the differencing, we consider the concentration data, shown
on Figure III.15. The sample size is 197. The series itself is on the border of
stationarity. Using Parzen window with M = 10, we get a really wide and flat
peak at the origin. Differencing produces a really sharp one. Away from zero, the
estimates practically coincide. Both estimates are shown on Figure 33.
To illustrate pre-whitening, we again use the concentration data. As it follows
from Figure III.17, the data could probably be described by an AR(2) model. We
find coefficients of the model and consider a series
Yt = (Xt − 17.06) − 0.4263(Xt−1 − 17.06) − 0.2576(Xt−2 − 17.06)
which is nearly a white noise with zero mean. We then estimate the spectrum of
Y using Parzen window with M = 10, and reconstruct the spectrum of X. The
estimate is shown on Figure 34 together with the usual one. Again, estimates prac-
tically coincide away from the origin, and the pre-whitening produces an estimate
that is no so flat around zero.
Integrated spectrum and integrated periodogram. Typically, the inte-
grated spectrum is defined as
Z ω
FX (ω) = 2 fX (θ) dθ
0
or (sometimes) as
Z ω
∗
FX (ω) = fX (θ) dθ
−π
2
In both cases, we get a non-negative increasing function such that F (π) = σX .
2
Dividing by σX , we are getting a normalized integrated spectrum
Z ω
2
H(ω) = 2 fX (θ) dθ
σX 0
For a white noise, H(ω) = ω/π.
175
Spectrum
5
0.500
0.050
0.005
Periods
30 10 5 4 3 2
Figure 34. Pre-whitening at work. As we can see, the estimates

practically coincide everywhere except the low frequencies. The
one with not-so-wide peak at zero is obtained by pre-whitening.
However, differencing (Figure 33) produces a really sharp peak
and that does not look right.
Periodogram
1.0
0.8
0.6
0.4
0.2
Periods
30 10 5 3 2
Figure 35. Integrated periodogram of the human brain data. The

series apparently contains two strictly periodic components, which
are responsible for the jumps.
In fact, integrated spectrum can be defined for processes with discrete spectra
as well, like in the example at the very beginning of Section 4.1; if the spectrum
is continuous, then the integrated spectrum is differentiable. For the discrete spec-
trum, the corresponding integrated spectrum is a step function.
A natural estimate for the (normalized) integrated spectrum is the integrated
periodogram
P
ω ≤ω I(ωp )
Ĥ(ω) = Pp
p I(ωp )
(Note that p I(ωp ) = σ̂ 2 is actually an estimate for the variance of the series.)
P
It could be shown that, under mild assumptions, the integrated periodogram
is an unbiased and consistent estimate for the integrated spectrum.
Integrated periodogram for a white noise. Suppose Xt is a white noise.
Consider
p ω
γ = max( N/2|Ĥ(ω) − |)
ω π
176
Periodogram
1.0
0.8
0.6
0.4
0.2
Periods
80 11 5.5 3 2
Figure 36. Integrated periodogram of the sunspots data. The

line above the diagonal is the 95 % confidence bound.
Note that
ω 1 X
Ĥ(ω) − = 2 (I(ωp ) − 1)
π σ̂X
ωp ≤ω
and random variables I(ωp ) − 1 are independent identically distributed with zero
mean. Hence Ĥ(ω) − ωπ behaves like sums of independent identically distributed
random variables conditioned to get to zero at p = N/2 (which corresponds to π).
Using the advanced tools of stochastic processes (so called Brownian bridge), we
can compute the distribution of γ. Namely,
r ∞
N ωp X 2 2
P {max |Ĥ(ωp ) − | ≤ a} ≈ ∆[2] (a) = (−1)j e−2a j
p 2 π −∞
The function ∆[2] is not easy to compute. However, it had to be done only once,
and somebody did that for us. In particular,
∆[2] (1.36) = .95

∆[2] (1.63) = .99
This way, we get the following way to test if the series is actually a white noise.
We compute γ and compare it with a desired percentile for ∆[2] . If γ exceeds the
percentile, we reject the white noise hypothesis. This test works well in situations
when the spectral density does not have significant peaks, but the total weight of,
say, low frequencies is significantly more than it supposed to be.
Fast Fourier transform. A straightforward computation of the periodogram
requires N 2 multiplications (for each particular frequency, we need 2N operations
Xt eiωt , and we have N/2 principal frequencies.
P
in order to compute
However, suppose that N = rs can be factored. Every t = 0, . . . , N − 1 can be
uniquely represented as t = rt1 +t0 where 0 ≤ t1 ≤ s−1, 0 ≤ t0 ≤ r −1 (actually, t1
is the biggest integer such that t1 ≤ t/r). In a similar way, every p = 0, . . . , N − 1
can be uniquely represented as p = sp1 + p0 where 0 ≤ p1 ≤ r − 1, 0 ≤ p0 ≤ s − 1.
177
Periodogram
1.0
0.8
0.6
0.4
0.2
Periods
30 10 5 3 2
Figure 37. Integrated periodogram of a first difference of the log-

arithm of the crops data. Not a white noise, at least at the 5 %
Now,
N
X N
X −1
d(ωp ) = Xt eiωp t = eiωp Xt+1 eiωp t
t=1 t=0
XX
iωp
=e Xrt1 +t0 +1 e2πip(rt1 +t0 )/N
t0 t1
X X
iωp
=e e2πipt0 /N Xrt1 +t0 +1 e2πiprt1 /N
t0 t1
Note that
e2πiprt1 /N = e2πip1 srt1 /N e2πip0 rt1 /N = e2πip0 rt1 /N
since rs/N = 1 and e2πiN = 1 for an integer N . Therefore
X
a(p0 , t0 ) = Xrt1 +t0 +1 e2πiprt1 /N
t1
depends only on p0 , t0 . Hence

X
d(ωp ) = eiωp e2πipt0 /n a(p0 , t0 )
t0
So we need to compute rs = N values of a(p0 , t0 ), each of them requires s multi-

plications. In addition to that, we need N/2 values of d(ωp ), each of them requires
r multiplications. As a result, the total number of operations becomes (r + s)N
instead of N 2 .
In a special case N = 2k , a number of operations can be reduced to 2kN =
2N log2 N .
Project.
(a) Evaluate (numerically) the spectral bandwidth for each of the following
processes
Xt − 1.8Xt−1 + 0.9Xt−2 = εt
Xt − 3Xt−1 + 4.02Xt−2 − 2.808Xt−3 + 0.864Xt−4 = εt
(1 + 0.028B + 1.06B 2 − 0.11B 3 + 0.21B 4 )Xt = (1 + 0.24B + 0.99B 2 )εt
178
(b) We expect the spectral bandwidth of the process to be about 0.03. We would
like to achieve maxω η 2 (ω) ≤ 0.1 where η 2 (ω) is the mean square percentage error
(4.1). Assuming we are using the Bartlett window, what should be the smallest
sample size? What is the optimal window width?
CHAPTER 5
Filters
1. Classification of Filters
Suppose two series Xt and Yt are related to each other by the formula
X
(1.1) Yt = bl Xt−l ,
l
or, more generally, by the formula
X X
(1.2) ak Yt−k = bl Xt−l
k l
(all sums should contain a finite number of non-zero terms). Treating Xt as an
input and Yt as an output, we call the relation (1.1) (or (1.2)) a filter.
In terms of back shift operator B, we Pcan rewrite (1.1) asPYt = β(B)Xt and
(1.2) as α(B)Yt = β(B)Xt where α(z) = k ak z k and β(z) = l bl z l .
From the practical point of view, (1.1) gives us the output right away whereas
(1.2) is an equation which has to be solved for Yt . Filters of the second type
are called recursive. For the recursive filters, the coefficients ak should vanish for
negative k (the same condition applies to bl though it is less important). Even
more, the polynomial α(z) should satisfy the stationarity condition: all roots of
α(z), complex or real, should be greater than one in absolute value.
P Supposel
Xt and Yt are stationary processes related by (1.1). Denote β(x) =
b
l l x . According to (IV.1.14), their spectral densities are related to each other
by the formula
fY (ω) = |β(eiω )|2 fX (ω)
The factor
2
T (ω) = β(eiω )

(1.3)
is called the transfer function of the filter. If the filter is recursive, that is, if it is
given by (1.2) rather than (1.1), then
β(eiω ) 2

fY (ω) =
fX (ω),
α(eiω )
and the transfer function is equal to
β(eiω ) 2

(1.4) T (ω) =
α(eiω )
A rather typical interpretation of the setup (1.1) and (1.2) is as follows. Suppose
Xt = St + Nt where St is a signal and Nt is a noise. Assume that the noise and
the signal are independent. Also, let us treat St and Nt as stationary processes
with spectral densities fS (ω) and fN (ω). Since S and N are independent, the
autocovariance function RX (k) = Cov(St + Nt , St+k + Nt+k ) = Cov(St , St+k ) +
179
180
Cov(Nt , Nt+k ) = RS (k) + RN (k) is a sum of the autocovariance functions and

therefore
fX (ω) = fS (ω) + fN (ω)
as well. The quotient Rπ
σS2 fS (ω)
2 = R −π
π
σN −π N
f (ω)
is the so-called signal-to-noise ratio. We’d like to have this ratio as large as possible.
Now suppose that the signal is mostly long-periodic, so that the corresponding
spectral density nearly vanishes outside of a small neighborhood of zero. On the
other hand, it reasonable to assume that the noise is white (or nearly white), so
that its spectral density is a constant. Apply a filter (either (1.1) or (1.2)) to Xt and
suppose the filter is designed in such a way that its transfer function T (ω) equals to
1 in the neighborhood of zero and vanishes outside of it, so that T (ω)fS (ω) ≈ fS (ω).
Then
fY (ω) = T (ω)fS (ω) + T (ω)fN (ω) ≈ fS (ω) + T (ω)fN (ω)
Rπ
and the variance of the new noise −π T (ω)fN (ω) dω is much smaller than the
variance of the original noise.
Classification of the filters. There exist four major types of filters. A ideal
low pass filter is a filter with the transfer function T (ω) = 1 if |ω| ≤ ω0 and T (ω) = 0
otherwise. The point ω0 is called the cut-off point. An ideal high pass filter has
the transfer function T (ω) = 1 if |ω| ≥ ω0 and T (ω) = 0 otherwise (again, ω0 is
called the cut-off point). An ideal band pass filter has a transfer function T (ω) = 1
if ω1 ≤ |ω| ≤ ω2 and T (ω) = 0 otherwise. An ideal band reject filter has a transfer
function T (ω) = 0 if ω1 ≤ |ω| ≤ ω2 and T (ω) = 1 otherwise. Band pass and band
reject filters have two cut-off points. However, ideal filters require infinitely many
past values of Xt and therefore can’t be actually constructed.
With a finite number of values involved, we can only construct an approxima-
tion to an ideal filter. So, if a transfer function of any particular filter can be viewed
as an approximation to say, ideal low pass filter with some cut-off point ω0 , we call
it a low pass filter with cut-off ω0 , and so on. As an example, consider a filter
(1.5) Yt − 0.943Yt−1 + 0.333Yt−2 = 0.0976(Xt + 2Xt−1 + Xt−2 )
We have α(B) = 1 − 0.943B + 0.333B 2 , β(B) = 0.0976(1 + B)2 . The polynomial α
satisfies the stationarity condition. We have
|α(eiω )|2 = |1 − 0.943 cos ω + 0.333 cos(2ω) + i(−0.943 sin(ω) + 0.333 sin(2ω))|2
= 2.00014 − 2.51404 cos ω + 0.666 cos(2ω).
In a similar way,
|β(eiω )|2 = 0.09762 |(1 + cos ω) + i sin ω|4
= 0.00952576(6 + 8 cos ω + 2 cos(2ω)).
Hence the transfer function equals
0.00952576(6 + 8 cos ω + 2 cos(2ω))
T (ω) =
2.00014 − 2.51404 cos ω + 0.666 cos(2ω)
If we look at the graph of this function, we see that it may be considered as
an approximation for ideal low pass filter. Its cut-off point should be somewhere
around π/4 (see Figure 1).
181
1.0
0.8
0.6
0.4
0.2
π /4 π /2 3π /4 π
Figure 1. Transfer function of the filter (1.5). Compare it with

the transfer function of an ideal low pass filter with cut-off π/4.
15
10
π /4 π /2 3π /4 π
Figure 2. Transfer function of the filter (1.6). We see that the

filter suppresses low frequencies and amplifies high frequencies. So,
it can’t be called a high pass filter (however, it can be made into
a high pass filter if we slightly modify the equation).
On the other hand, consider now a filter
(1.6) Yt = Xt − 2Xt−1 + Xt−2
Computation reveals
T (ω) = 4(1 − cos ω)2 .
Graph of this function is shown on Figure 2. As we can see, the filter (1.6) suppresses
low frequencies and amplifies high frequencies. This function does not approximate
a transfer function of any of the ideal filters, so it is not a filter of any of the above
types.
Moving Average as a filter. We have seen the relation of the type (1.1)
when we have discussed the method of moving averages in Section I.2. According
to (I.2.1), an estimate fˆ(t) for the trend was constructed as
l
1 X
fˆ(t) = Yt = Xt−k
2l + 1
k=−l
182
1.0
0.8
0.6
0.4
0.2
0.5 1.0 1.5 2.0 2.5 3.0
Figure 3. Transfer function of the moving averages with l = 5.
1.0
0.8
0.6
0.4
0.2
0.5 1.0 1.5 2.0 2.5 3.0
Figure 4. Transfer function of the moving averages with l = 15.
We can consider this relation as a filter (1.1). Let us find its transfer function.
Corresponding function β(x) equals
l
1 X 1
β(x) = xk = x−l (1 + x + x2 + · · · + x2l )
2l + 1 2l + 1
k=−l
1 1 − x2l+1
= x−l .
2l + 1 1−x
So,
1 |1 − e(2l+1)iω |2
(1.7) T (ω) = 2
|e−ilω |2
(2l + 1) |1 − eiω |2
Now, for a real number z, |eiz | = 1 and
|1 − eiz |2 = (1 − cos z)2 + (− sin z)2 = 2 − 2 cos z.
For these reasons, (1.7) boils down to
1 1 − cos((2l + 1)ω)
(1.8) T (ω) = .
(2l + 1)2 1 − cos(ω)
1
L’Hôpital’s rule implies that T (0) = 1. On the other hand, T (π) = (2l+1) 2 is close
to zero if l is large enough. If we look at the graph of this function, we clearly

see that it is equal to 1 at zero, then it drops to zero and it stays practically zero
after that (see Figures 3 and 4 for l = 5 and l = 15). So, we can consider
183
1.0
0.8
0.6
0.4
0.2
ω0 π/4 π/2 3π/4 π
Figure 5. We should be in full control of the cut-off point ω0 as

well as of the quality of filtration, like on this picture ...
them as an approximation for the low pass filter. However, it is not quite clear
what should we name as a cut-off point for this filter. For instance, we may notice
that T (π/(2l + 1)) = 0 and that is the smallest positive zero of this function. So,
ω0 = π/(4l + 2) could be considered as a cut-off point. The value of the transfer
function at ω0 is approximately equal to 2/π 2 . However, we have very little control
over the cut-off point, and we don’t have any means to make the transfer function
to approximate the “ideal” 0-1 transfer function of a low pass filter. We should
look for better options. Namely, we definitely need full control of the cut-off point.
Also, we should be able to approximate the transfer function of the ideal filter as
well as we want, something like shown on Figure 5.
First order low pass sine filter. Suppose
(1.9) Yt − aYt−1 = (1 − a)Xt
where a > 0 is a parameter. Since (1.9) defines a recursive filter, the stationarity
condition implies that a < 1. According to (1.14),
2
(1 − a)2

(1 − a)
fY (ω) = fX (ω) = fX (ω)
1 − a cos ω − ia sin ω 1 + a2 − 2a cos(ω)
and therefore
(1 − a)2
T (ω) =
1 + a2 − 2a cos(ω)
Note that 1 + a − 2a cos ω ≥ 1 + a − 2a = (1 − a)2 and therefore T (ω) ≤ 1. Since
2 2
1 − cos ω = 2 sin2 (ω/2), T (ω) can be rewritten as

1
(1.10) T (ω) = 4a 2 ω
1 + (1−a) 2 sin 2
2
As we can see, T (0) = 1, T (π) = (1−a)
(1+a)2 . The function T (ω) is decreasing, its shape
depends on the parameter a. Let us find a point ω0 such that
1
T (ω0 ) = ,
2
2 √
which is equivalent to the equation sin2 ω20 = (1−a)
4a . If a > 3 − 2 2 ≈ 0.172, right
side is less than 1 and we can solve for ω0 . We get
(1 − a)
(1.11) ω0 = 2 arcsin √
2 a
184
In terms of ω0 , (1.10) can be rewritten as

1
(1.12) T (ω) = 2
sin(ω/2)
1+ sin(ω0 /2)
Filter (1.12) is a simplest representative of the family of so-called Butterworth

sine filters. A sine low pass filter of the order n is a filter with the transfer function
1
(1.13) T (ω) = 2n
sin(ω/2)
1 + sin(ω 0 /2)
As we can see, the function T is decreasing, T (0) = 1 and T (ω0 ) = 1/2. Let us find
out what happens to T (ω) if we increase n, say, send it to infinity. If 0 < ω < ω0 ,
then sin(ω/2) < sin(ω0 /2) and therefore
2n
sin(ω/2)
→0
sin(ω0 /2)
as n → ∞. If, however, ω0 < ω < π, then sin(ω/2) > sin(ω0 /2) and therefore
2n
sin(ω/2)
→∞
sin(ω0 /2)
Therefore T (ω) → 1 if ω < ω0 and T (ω) → 0 if ω > ω0 , that is, T (ω) approximates
the transfer function of the ideal low pass filter with cut-off point ω0 . The quality
of approximation depends on the order of the filter. If we graph transfer functions
of sine filters of different orders with the same ω0 , we get a picture similar to Figure
5. For the above reasons, the point ω0 is called the cut-off point of the sine filter.
Solving (1.11) for a, we can construct a first order filter with given cut-off point ω0 .
However, designing an n-th order filter is another story.
If we begin with the equation
(1.14) Yt + aYt−1 = (1 − a)Xt
instead of (1.9), we get a high pass filter with transfer function
1
T (ω) = 4a ω ,
1+ (1−a)2 cos2 2
so called Butterworth cosine filter of the order 1. Its cut-off point could be found
2 √
from the equation cos2 ω20 = (1−a)
4a which again has a solution if a > 3 − 2 2.
Instead of sine and cosine filters, we will focus on so called Butterworth tangent
filters which are much easier to design. A tangent low pass filter of the order n is
a filter with the transfer function
1
(1.15) T (ω) = 2n
tan(ω/2)
1 + tan(ω 0 /2)
Once again, we can see that T (ω) is decreasing, T (0) = 1 and T (ω0 ) = 1/2. Also,
if n → ∞, then T (ω) → 1 if ω < ω0 and T (ω) → 0 if ω > ω0 , same as for the sine
filters.
There is one important difference between sine/cosine filters and tangent filters.
Sine filter of the order n has the structure
α(B)Yt = Xt
185
1.0
0.8
0.6
0.4
0.2
0.0 0.5 1.0 1.5 2.0 2.5 3.0
Figure 6. Transfer functions of the low pass sine and tangent

filters of the order 5, cut-off point is π/10. Can you see any differ-
ence?
0.100
0.001
10-5
10-7
10-9
10-11
0.0 0.5 1.0 1.5 2.0 2.5 3.0
Figure 7. Transfer functions of the low pass sine and tangent

filters of the order 5 in log scale. Now we can see a difference at
high frequencies, tangent filter works better (its transfer function
is the lower one).
where the polynomial α(B) has the degree n. Tangent filter of the order n has the
structure
α(B)Yt = β(B)Xt
where both polynomials α(B) and β(B) have the degree n.
At low frequencies, transfer functions of the sine filter and tangent filter prac-
tically coincide. However, there is a difference at high frequencies. For a sine filter,
1
T (π) = 2n
1 + sin(ω10 /2)
though for a tangent filter T (π) = 0. So, tangent filters do a better job eliminating
undesired highly oscillating components. On Figure 6, you can see transfer functions
for the sine filter and for the tangent filter, both of them of the order 5, with
the same cut-off point π/10. The difference between them can be clearly seen in
logarithmic scale (Figure 7).
Designing filters with given transfer function will be discussed in the next sec-
tion.
Exercises
186
For the following filters, find their transfer functions and graph them. Decide
if the filter could be called a low pass filter? a high pass filter? a band pass filter?
a band reject filter? If so, what is, approximately, the cut-off point(s)?
1. Yt = 0.25Xt−1 + 0.5Xt + 0.25Xt+1 .
2. Yt − 0.5Yt−1 = Xt + Xt−1 .
3. Yt = 0.5(Xt − Xt−1 ).
4. Yt = 0.25Xt − 0.5Xt−1 + 0.25Xt−2 .
5. Yt − 0.727Yt−1 = 0.137(Xt + Xt−1 )
6. Yt = 0.5(Xt + Xt−1 )
7. Yt − 1.561Yt−1 + 0.641Yt−2 = 0.02(Xt + 2Xt−1 + Xt−2 )
8. 3.414Yt + 0.586Yt−2 = Xt + 2Xt−1 + Xt−2
9. Yt + 0.414Yt−1 = 0.293(Xt − Xt−1 )
9. Yt + 0.577Yt−1 = 0.211(Xt − Xt−1 )
10. Yt + 0.943Yt−1 + 0.333Yt−2 = 0.098(Xt − 2Xt−1 + Xt−2 )
11. Yt + 1.28Yt−1 + 0.478Yt−2 = 0.0495(Xt − 2Xt−1 + Xt−2 )
12. 2Yt − 1.414Yt−1 = Xt − Xt−2
13. 2Yt − 1.414Yt−1 = Xt − 1.414Xt−1 + Xt−2
14. Yt − 0.607Yt−1 + 0.51Yt−2 = 0.245(Xt − Xt−2 )
15. Yt − 0.607Yt−1 + 0.51Yt−2 = 0.755(Xt − 0.805Xt−1 + Xt−2 )
16. Yt + 1.051Yt−1 + 0.649Yt−2 = 0.175(Xt − Xt−2 )
17. Yt + 1.051Yt−1 + 0.649Yt−2 = 0.825Xt + 1.051Xt−1 + 0.825Xt−2
2. Construction of the filters

We begin with the following problem. Let
α(B)Yt = β(B)Xt
be a filter and let

β(eiω ) 2

T (ω) =

α(eiω )
be its transfer function. How could we recover a filter from its transfer function?
Denote by n and m the degrees of the polynomials α and β. Polynomial α and
β can be factorized:
α(z) = a0 + a1 z + · · · + an z n = an (z − z1 ) . . . (z − zn ),
β(z) = b0 + b1 z + · · · + bm z m = bm (z − w1 ) . . . (z − wm )
So, the roots of the polynomials define the filter up to unknown quotient an /bm (a
proportional change of all of the coefficients does not change the filter). However,
the maximal value of the transfer function should be equal to one. Depending on
the type of the filter, this leads to a normalization condition which allows us to find
the value |an /bm |. For instance, for the low pass filters, as well as for band reject
filters, T (0) = 1, which is equivalent to the property |α(1)| = |β(1)|. (In fact, it is
quite natural to assume that, if the input is a constant, then the output must be
the same constant, which means α(1) = β(1), without absolute values). For the
high-pass filters, T (π) = 1, which is equivalent to |α(−1)| = |β(−1)|. For the band
pass filters, we actually have to find a point where T (ω) reaches its maximum, and
set that value to one.
187
β(z)
Denote G(z) = α(z) . Let us consider T (ω) = T̃ (eiω ) as a function on a unit
circle. On the unit circle,
1
T̃ (eiω ) = |G(eiω )|2 = G(eiω )G(e−iω ) = G(eiω )G( )
eiω
Hence the function T̃ (z) coincides with an analytic function G(z)G(1/z) which is
therefore an analytic continuation of T̃ (z). By the properties of analytic functions,
an analytic continuation is unique (if it exists at all).
Now, α(z) must satisfy the stationarity condition. Therefore its roots z1 , . . . , zn
must be outside of the unit circle. In order to find the roots of α(z), let us find the
poles of T̃ (z), that is the roots of T̃ −1 (z). Clearly, T̃ −1 (z) = G−1 (z)G−1 (1/z) = 0
if and only if α(z) = 0 or α(1/z) = 0. Hence T̃ (z) has 2n poles z1 , . . . , zn and
z1−1 , . . . , zn−1 , of which the first n are outside of the unit circle and the other n are
inside (so we can easily decide which roots are to be used).
Let us now find zeroes of T̃ (z) = G(z)G(1/z). Clearly, G(z)G(1/z) = 0 if and
only if β(z) = 0 or β(1/z) = 0. So T̃ (z) has 2m zeros, namely w1 , . . . , wm and
w1−1 , . . . , wm −1
. So, the roots come in pairs w1 , w1−1 , . . . , wm , wm
−1
. Out of each pair,
only one root is to be used. It looks like we have some flexibility here. However, in
most cases, wi = ±1.
Hence, in order to implement a filter with a given transfer function T (ω), we
should do the following.
Step 1. Consider T (ω) as a function on the unit circle and find its analytic
continuation T̃ (z) (it is unique if exists at all).
Step 2. Find all the roots and all the poles of T̃ (z). Choose those poles that
are outside the unit circle. Select half of the zeroes to be used.
Step 3. Find the quotient an /bm from the normalization condition.
Low pass Tangent Filters. Practical realization of this program for the
Butterworth low pass tangent filters is not so difficult. We begin with the formula
1 eiω/2 − e−iω/2 1 eiω − 1 1 − eiω
tan(ω/2) = = = i
i eiω/2 + e−iω/2 i eiω + 1 1 + eiω
Denote A = tan(ω0 /2). Replacing eiω by z, we easily get an analytic continuation
for T̃ (z):
A2n
T̃ (z) = 2n .
A2n + (−1)n 1−z
1+z
Hence, its poles can be obtained from the equation

2n
1−z
(2.1) = (−1)n+1 A2n
1+z
which can be easily solved if we know enough of complex variables (we have to use
the so-called roots of unity for that, see appendix C). It could be shown that it has
2n roots, half of them in the upper half plane (and those are outside of the unit
circle) and the other half is in the lower half plane and inside the unit circle. So
we can easily decide which roots are to be used. Zeroes of T̃ (z) are even easier,
T̃ (z) = 0 if and only if z = −1 but this root has multiplicity 2n so we have to use
n of them.
188
First order low pass tangent filter. We have n = 1 and the equation (2.1)
implies
2
1−z
= A2
1+z
Assume for now that ω0 6= π/2, so that A 6= 1. We have
1−z 1∓A
= ±A, z=
1+z 1±A
1+A
Now, A = tan(ω0 /2) is positive and therefore the root z1 = 1−A is outside the unit
circle. Hence
α(z) = a1 (z − z1 ), β(z) = b1 (1 + z)
Finally, the normalization condition yields
b1 1 − z1 A
a1 (1 − z1 ) = 2b1 , = =
a1 2 A−1
A
Setting a1 = 1, we get b1 = A−1 , so the filter should be
1+A A
− Yt + Yt−1 = (Xt + Xt−1 )
1−A A−1
or
(2.2) (A + 1)Yt + (A − 1)Yt−1 = A(Xt + Xt−1 )
This formula still works if ω0 = π/2 and A = 1, though the filter becomes non-
recursive.
Other way around, suppose the filter is given by (2.2). Then
|α(eiω )|2 = |(A + 1) + (A − 1) cos ω + i(A − 1) sin ω|2
= (A + 1)2 + 2(A + 1)(A − 1) cos ω + (A − 1)2 cos2 ω + (A − 1)2 sin2 ω
= 2A2 + 2 + 2(A2 − 1) cos ω = 2(A2 (1 + cos ω) + (1 − cos ω))
In a similar way,
|β(eiω )|2 = |A((1 + cos ω) + i sin ω)|2
= A2 (1 + 2 cos ω + cos2 ω + sin2 ω) = 2A2 (1 + cos ω).
Hence the transfer function
A2 (1 + cos ω) 1 1
T (ω) = 2 = 1 1−cos ω = tan2 (ω/2)
A (1 + cos ω) + (1 − cos ω) 1+ A2 1+cos ω 1+ A2
π π
Example. Let ω0 = 6. Then A = tan( 12 ) ≈ 0.268 and (2.2) becomes
1.268Yt − 0.732Yt−1 = 0.268(Xt + Xt−1 ).
If we wish, we can divide all coefficients by the first one, and get the equation
Yt − 0.577Yt−1 = 0.211(Xt + Xt−1 ).
Second order low pass tangent filter. The equation (2.1) becomes
4
1−z
= −A4
1+z
and therefore √ √
1−z 2 2
= A(± ±i )
1+z 2 2
189
(all four combinations of signs are possible). We need to choose two of them. As
we know, if w = 1−z 1−w
1+z , then z = 1+w and |z| > 1 if and only if |w − 1| > |w + 1|,
that is if and only if w belongs to the left half plane. With this in mind, we can
find
1 √
z1,2 = √ (1 − A2 ± 2Ai)
1 − 2A + A2
Note that √ √
(1 − 2A + A2 )(1 + 2A + A2 ) = 1 + A4 .
Therefore
α(z) = a2 (z − z1 )(z − z2 ) = a2 (z 2 − (z1 + z2 )z + z1 z2 )
2(1 − A2 ) 1 + A4
= a2 (z 2 − √ z+ √ )
1 − 2A + A2 (1 − 2A + A2 )2
√
2 2(1 − A2 ) 1 + 2A + A2
= a2 (z − √ z+ √ )
1 − 2A + A2 1 − 2A + A2
Next,
β(z) = b2 (z + 1)2 = b2 (z 2 + 2z + 1)
and we can find the quotient a2 /b2 from the normalization condition α(1) = β(1).
We have β(1) = 4b2 and
2(1 − A2 ) 1 + A4
α(1) = a2 (1 − √ + √ )
1 − 2A + A2 (1 − 2A + A2 )2
√ √
(1 − 2A + A2 )2 − 2(1 − A2 )(1 − 2A + A2 ) + 1 + A4
= a2 √
(1 − 2A + A2 )2
√
4A2 − 4 2A3 + 4A4 A2
= a2 √ = 4a2 √
(1 − 2A + A2 )2 1 − 2A + A2
After all cancelations, the normalization condition α(1) = β(1) reads
√
A2 a2 = (1 − 2A + A2 )b2 .
√
Choosing a2 = 1 − 2A + A2 and b2 = A2 , we end up with the following expression:
√ √
(1 + 2A + A2 )Yt − 2(1 − A2 )Yt−1 + (1 − 2A + A2 )Yt−2
(2.3)
= A2 (Xt + 2Xt−1 + Xt−2 ).
√
Example. Let ω0 = π3 . Then A = tan( π6 ) = 1/ 3 ≈ 0.577 and (2.3) becomes
r ! r !
4 2 4 4 2 1
+ Yt − Yt−1 + − Yt−2 = (Xt + 2Xt−1 + Xt−2 ),
3 3 3 3 3 3
or, approximately,
2.15Yt − 1.333Yt−1 + 0.517Yt−2 = 0.333(Xt + 2Xt−1 + Xt−2 ).
Role of the order of the filter. Order of the filter determines the shape
of the transfer function, and therefore it determines the quality of a filtration. To
illustrate the difference between the filters of different order, we apply them to the
(simulated) data (Figure 8) of the form Xt = St + εt where the signal St = 1 if t
belongs to one of the intervals [101, 200], [280, 300] or 380, 400], otherwise it is equal
to zero, and the noise εt is a white noise with expectation zero and variance σ 2 = 1.
For instance, we can think of a signal as a dash-dot-dot signal, which corresponds
190
3
2
1
100 200 300 400 500

-1
-2
-3
Figure 8. Signal plus noise data Xt = St + εt . The signal is

shown as a pink line, St = 1 if t belongs to one of the intervals
[101, 200], [280, 300] or 380, 400], otherwise it is equal to zero. It
may stand for a character “d” in Morse dash-dot code. The noise
εt is a stationary process with variance 1.
1.0
0.8
0.6
0.4
0.2
100 200 300 400 500

-0.2
Figure 9. Low pass tangent filter of the order 2, with the same
cut-off point π/30. The curve looks just a bit more smooth than
the sine filter results.
1.0
0.8
0.6
0.4
0.2
100 200 300 400 500

-0.2
cut-off point π/30.
to the character “d” in Morse dash-dot code. We can treat it as a low frequency
signal, and apply low pass filters to the data. On Figures 9-11, we can see tangent
filters of the order 2, 4 and 6 with cut-off point ω0 = π/30 in action. Figure 12
explains how the cut-off point was chosen.
191
1.0
0.8
0.6
0.4
0.2
100 200 300 400 500

-0.2
cut-off point π/30.
Periodogram HfragmentL
3.5
3.0
2.5
2.0
1.5
1.0
0.5
0.0 Frequency
0.07 cut 0.2 0.3
Figure 12. How to choose the cut-off point? Why did we choose
π/30? We have looked at the periodogram, decided about the band
that contains the signal (low frequency signal in this example, the
band is (0, 0.07) or so) and then choose the cut-off point with some
room to spare. Don’t make the band way too narrow, better let a
bit of noise in.
High pass tangent filters. In a similar way, we could construct other types
of filters. A high pass tangent filter of the order n with the cut-off frequency ω0 is
a filter with the transfer function
1
T (ω) = 2n
cot(ω/2)
1 + cot(ω 0 /2)
(similar to the low pass filter, we have T (π) = 1, T (0) = 0 and T (ω0 ) = 1/2).
Construction of the high pass filter begins with the identity
ω eiω + 1
cot = i iω
2 e −1
Denote C = cot(ω0 /2). Replacing eiω by z, we get a following analytic continuation
for T̃ (z):
C 2n
T̃ (z) = 2n
z+1
C 2n + (−1)n z−1
192
Therefore, the poles of T̃ satisfy the equation

2n
z+1
(2.4) = (−1)n+1 C 2n ,
z−1
and all zeroes of T̃ are equal to 1.
First order high pass tangent filter. Similar to first order low pass filters,
assume that ω0 6= π/2 and C 6= 1. For the first order filter, the equation (2.4)
implies
z+1
= ±C.
z−1
C+1
Since C > 0, the only root outside the unit circle is equal to z1 = C−1 . Therefore
α(z) = a1 (z − z1 ), β(z) = b1 (z − 1)
For the high pass filters, the normalization condition reads α(−1) = β(−1) (if
Xt = (−1)t is an oscillation with the frequency π, then Yt is the same). This way
we arrive at the filter
(2.5) (C + 1)Yt − (C − 1)Yt−1 = C(Xt − Xt−1 ).
The formula (2.5) still works if ω0 = π/2 and C = 1.
Example. Let ω0 = 9π 9π
10 . Then C = cot( 20 ) ≈ 0.158 and (2.5) becomes
1.158Yt + 0.842Yt−1 = 0.158(Xt − Xt−1 ).
Second order high pass tangent filter. This time, (2.4) implies
4
z+1
= −C 4
z−1
and therefore √ √
z+1 2 2
= C(± ±i )
z−1 2 2
z+1 w+1
As we know, if w = z−1 , then z = w−1 and |z| > 1 if and only if |w + 1| > |w − 1|,
that is if and only if w belongs to the right half plane. Hence, we can find
w+1
z1,2 =
w−1
with √ √
2 2
w = C( ±i )
2 2
This way we get
1 √
z1,2 = √ (C 2 − 1 ± 2Ci)
1 − 2C + C 2
and
α(z) = a2 (z − z1 )(z − z2 ) = a2 (z 2 − (z1 + z2 )z + z1 z2 )
2(C 2 − 1) 1 + C4
= a2 (z 2 − √ z+ √ )
1 − 2C + C 2 (1 − 2C + C 2 )2
√
2(C 2 − 1) 1 + 2C + C 2
= a2 (z 2 − √ z+ √ )
1 − 2C + C 2 1 − 2C + C 2
and
β(z) = b2 (z − 1)2 = b2 (z 2 − 2z + 1)
193
and we can find a2 /b2 from the normalization condition α(−1) = β(−1). After all
transformations, we get the same equation
√
C 2 a2 = (1 − 2C + C 2 )b2 .
as in the case of the second order low pass filter (no surprise,
√ compare the expres-
sions for α(z) and β(z) in both cases). Choosing a2 = 1 − 2C + C 2 and b2 = C 2 ,
we end up with the following expression:
√ √
(1 + 2C + C 2 )Yt − 2(C 2 − 1)Yt−1 + (1 − 2C + C 2 )Yt−2
(2.6)
= C 2 (Xt − 2Xt−1 + Xt−2 )
5π
Example. Let ω0 = 6 . Then C = cot( 5π
12 ) ≈ 0.268 and (2.6) becomes
1.451Yt − 1.856Yt−1 + 0.692Yt−2 = 0.0718(Xt − 2Xt−1 + Xt−2 ).
Band pass and band reject filters have two cut-off points. Traditionally,
they are parameterized by the center of the band ωc and the bandwidth 2B (so the
cut-off points are ωc ± B). Band pass filter of the order n has the transfer function
1
(2.7) T (ω) = n
cos(ω)−cos(ωc ) sec(B)
1+ tan B sin(ω)
where n must be even. It could be shown that T (ωc ± B) = 21 , T (0) = T (π) = 0

and the maximal value of T (ω) is equal to one though the corresponding frequency
ωmax differs from the center of the band ωc . Namely, ωmax could be found from the
equation
cos(ωmax ) = cos(ωc ) sec(B).
A band reject filter could be obtained from here by taking
1
(2.8) T (ω) = 1 − n
cos(ω)−cos(ωc ) sec(B)
1+ tan B sin(ω)
Second order band pass tangent filter. Let the transfer function of the
filter be given by (2.7) with n = 2:
1
(2.9) T (ω) = 2
1 + cos(ω)−cos(ω c ) sec(B)
tan B sin(ω)
It is natural to assume that 0 < ωc ± B < π (the band is strictly inside the interval
[0, π]). In addition, we assume that B 6= π/4.
Denote
cos(ωc )
D = cos(ωmax ) = cos(ωc ) sec(B) = , E = tan B
cos(B)
Clearly, 0 < B < ωc < π − B and therefore cos(B) > cos(ωc ) > cos(π − B) =
− cos(B). For this reason, |D| < 1. Also, E 6= 1 by assumption. We have
1
T (ω) = 2
1 + cos(ω)−D
E sin(ω)
We are going to use the procedure described above, that is:

1. Consider T (ω) as a function on the unit circle and find its analytic continu-
ation T̃ (z).
194
2. Find all the roots and all the poles of T̃ (z). The function would have four
roots w1 , 1/w1 , w2 , 1/w2 and four poles z1 , 1/z1 , z2 , 1/z2 . Take those poles that are
outside of the unit circle (we denote them by z1 and z2 ). Select half of the zeroes
to be used (one out of each pair; denote them by w1 , w2 ). Get
α(z) = a2 (z − z1 )(z − z2 ), β(z) = b2 (z − w1 )(z − w2 )
3. Find the quotient a2 /b2 from the normalization condition (maximal value of
T (ω) should be equal to 1).
To begin with, recall that
eiω + e−iω eiω − e−iω
cos(ω) = , sin(ω) =
2 2i
−iω
Replacing eiω by z and e by 1/z, we get the following formula for T̃ (z):
1 1
(2.10) T̃ (z) = 2 = (z 2 −2Dz+1)2
1+ i z+1/z−2D 1 − E 2 (z2 −1)2
E(z−1/z)
Let us find the roots and the poles of the function T̃ (z).
Clearly, T̃ (z) = 0 if and only if (z 2 − 1)2 = 0, that is, if z = ±1. Each of those
roots has multiplicity 2. According to the procedure, we should take one of each of
them. So, w1 = 1, w2 = −1 and therefore
β(z) = b2 (z 2 − 1)
Now, we have the following equation for the poles:
(z 2 − 2Dz + 1)2
=1
E 2 (z 2 − 1)2
which is equivalent to the equation
z 2 − 2Dz + 1 = ±E(z 2 − 1)
So, the poles of the function T̃ (z) could be found from the following two quadratic
equations
(2.11) (1 + E)z 2 − 2Dz + (1 − E) = 0
or
(2.12) (1 − E)z 2 − 2Dz + (1 + E) = 0
The equations have real roots if D2 + E 2 ≥ 1, otherwise all the roots are complex.
But, which roots are outside of the unit circle?
First of all, note that if z satisfies (2.11), then 1/z satisfies (2.12).
Suppose that D2 + E 2 < 1, so the roots are imaginary. Then, for each of
the equations, the roots are conjugate to each other, they have the same absolute
value, so their product is equal to the square of their absolute value. However, the
product of the roots can be easily found from the coefficients. Hence, the square of
the absolute value of the roots is equal to (1 − E)/(1 + E) for (2.11) and it is equal
to (1 + E)/(1 − E) for (2.12). Since 0 < E < 1 in this case, (2.12) is the equation
with the roots outside of the unit circle.
Let now D2 + E 2 ≥ 1, so the roots are real. Let us show that (2.11) has both
roots within the interval [−1, 1]. Indeed, p(z) = (1 + E)z 2 − 2Dz + (1 − E) > 0 at
1 and −1 because |D| < 1. Also, p(z) reaches the minimum at z0 = D/(1 + E).
2
−D 2
Since |D| < 1 and E > 0, |z0 | < 1. However, the value p(z0 ) = 1−E 1+E ≤ 0. So,
195
it is again the equation (2.12) that has the roots bigger than one in absolute value.
(If B = π/4 and E = 1 , then the equation (2.12) degenerates into the first order
equation. However, its root 1/D is still bigger than one in absolute value, and both
of the roots of (2.11) (which are 0 and D) are inside [−1, 1].)
From here, it follows that z1 z2 = (1 + E)/(1 − E), z1 + z2 = 2D/(1 − E) and
therefore
2D 1+E
α(z) = a2 (z 2 − z+ )
1−E 1−E
It remains to find the quotient a2 /b2 . As we can see from (2.9), T (ω) = 1
if ω = ωmax = arccos(D). So, we should have T̃ (z) = 1 if z = zmax = eiωmax .
However, p
zmax = cos(ωmax ) + i sin(ωmax ) = D + i 1 − D2
(since 0 < ωmax < π, sin(ωmax ) > 0). Hence,
p
2
zmax = 2D2 − 1 + 2iD 1 − D2
and
p 2D p 1+E
α(zmax ) = a2 (2D2 − 1 + 2iD 1 − D2 − (D + i 1 − D2 ) + )
1−E 1−E
E p E
= a2 (2(D2 − 1) + i2D 1 − D2 )
E−1 E−1
E p
= 2a2 ((D2 − 1) + iD 1 − D2 )
E−1
In a similar way,
p p
β(zmax ) = b2 (2D2 − 2 + 2iD 1 − D2 ) = 2b2 ((D2 − 1) + iD 1 − D2 )
and therefore
β(zmax ) b2 E − 1
=
α(zmax ) a2 E
For instance, we can take b2 = E and a2 = E − 1. Hence, the formula for the filter
becomes
(2.13) (E + 1)Yt − 2DYt−1 + (1 − E)Yt−2 = EXt − EXt−2 .
Second order band reject tangent filter. With band pass filter done, band
reject filter is easy. Since the transfer function of the band reject filter is one minus
the transfer function of the band pass filter, we immediately get from (2.10)
1
T̃ (z) = 1 − (z 2 −2Dz+1)2
1 − E 2 (z2 −1)2
Moreover, T̃ (z) has the same poles as the function (2.10) and, as above,
2D 1+E
α(z) = a2 (z 2 − z+ ).
1−E 1−E
√
Next, T̃ (z) = 0 if and only if z 2 − 2Dz + 1 = 0, that is if z = D ± i 1 − D2 (and
each of the roots actually has multiplicity two). We need to choose two of them
and, for the first time, it looks like we have some freedom. However, the only√ way
to get a filter with
√ coefficients that are real numbers, is to take w1 = D + i 1 − D2
2
and w2 = D − i 1 − D . Therefore
β(z) = b2 (z 2 − 2Dz + 1)
196
The quotient a2 /b2 could be found from the condition T (0) = T (π) = 1, which is
equivalent to
α(1) = β(1), α(−1) = β(−1)
However,
2a2 (1 − D)
α(1) = , β(1) = 2b2 (1 − D)
1−E
2a2 (1 + D)
α(−1) = , β(−1) = 2b2 (1 + D)
1−E
and we can take a2 = 1 − E, b2 = 1. Hence, the formula for the filter becomes
(2.14) (E + 1)Yt − 2DYt−1 + (1 − E)Yt−2 = Xt − 2DXt−1 + Xt−2 .
Finally, band pass and band reject filters were constructed under the assump-
tion B 6= π/4 which is equivalent to E 6= 1. However, both formulas (2.13) and
(2.14) work in that case as well (though α(z) becomes a first order polynomial).
Example. Suppose the center of the band ωc = 5π π
12 and the bandwidth is 5 ,
π 5π π π
that is B = 10 . Then D = cos( 12 ) sec( 10 ) ≈ 0.272 and E = tan( 10 ) ≈ 0.325. For
the band pass filter, (2.13) becomes
1.325Yt − 0.544Yt−1 + 0.675Yt−2 = 0.325(Xt − Xt−2 ).
For the band reject filter, (2.14) becomes
1.325Yt − 0.544Yt−1 + 0.675Yt−2 = Xt − 0.544Xt−1 + Xt−2 .
Remark. Sometimes, the parametrization by the center of the band and the
bandwidth is not convenient. For instance, we may want to design, say, a band
reject filter with given bandwidth in such a way that T (ω0 ) = 0 for some specific
frequency ω0 , so that this frequency will be completely eliminated. If this is the
case, the answer could be found from the same equations (2.13) or (2.14) if we set
D = cos(ω0 ), E = tan(B).
Example. The data shown on the Figure 13 contains a strong periodic com-
ponent with frequency ω0 ≈ 1.69163. However, we suspect that the data may
also contain some other signals. Using a low pass or high pass filter will not solve
the problem, the oscillation is so strong that it will make through unless T (ω0 ) is
practically zero. So, we would like to create a band reject filter such that its trans-
fer function vanishes at ω0 . The parameter B is at our disposal, we have chosen
B = π/10. We get
D = cos(ω0 ) = −0.120537, E = tan(B) = 0.32492
and the formula (2.14) becomes
(1 + 0.181953B + 0.509525B 2 )Yt = (0.754763 + 0.181953B + 0.754763B 2 )Xt
The results of filtration could be seen on Figure 14. Indeed, the data contains a
dash-dot signal. Just for comparison, Figure 15 shows what happens if we apply a
low pass filter with cut-off point π/20 instead.
Exercises
1. Compute the coefficients of a 2nd order low pass tangent filter with the
cut-off point ω0 = π/30, and apply it to the data (will be posted on the web).
197
1000
500
100 200 300 400 500 600

-500
-1000
Figure 13. Data with strong periodic component.
100 200 300 400 500 600
-1
Figure 14. Band reject filter in action.
100 200 300 400 500 600

-5
-10
-15
-20
Figure 15. 2nd order low pass filter with cut-off point π/20, ap-
plied to the same data set.
2. Compute the coefficients of a 2nd order high pass tangent filter with the
cut-off point ω0 = 19π/20, and apply it to the data (will be posted on the web).
3. Compute the coefficients of a 2nd order band pass tangent filter with cut-off
points 5π/6 ± π/20 and apply it to the data (will be posted on the web).
4. Compute the coefficients of a 2nd order band reject tangent filter with cut-off
points π/3 ± π/10 and apply it to the data (will be posted on the web).
198
5. Verify that the transfer function of the filter (2.5) is equal to

1
T (ω) = 2 .
cot(ω/2)
1+ A
6*. Let T (ω) be given by the formula (2.7). Show that T (ω) → 1 if |ω−ωc | < B
and T (ω) → 0 if |ω − ωc | > B as n → ∞.
7*. Following the procedure described above, construct a 3rd order low pass
tangent filter with cut-off point ω0 .
8*. The data (will be posted on the web) contains a signal, some noise and
two strong periodic components, one with frequency π/7 and the other one with
frequency π/8. In order to decode the signal, design a second order band-reject
tangent filter (use B = π/15) that would completely eliminate the frequency π/8
and apply it to the data. Next, design a band reject filter (again, use B = π/15)
that would completely eliminate the frequency π/7 and apply it to the result of the
first filtration. Finally, apply the second order low pass tangent filter with cut-off
point π/15 to the result of the second filtration. Graph the results.
Project
A data set contains dash-dot signals hidden in the noise. Decode the signals!
(data set will be posted on the web).
3. Filters and Phase shift

Suppose
(3.1) Yt = β(B)Xt = b0 Xt + b1 Xt−1 + . . .
and assume that Xt = cos(ωt). We’d like to compute an action of the filter (3.1)
on the harmonics Xt . As we can see from (3.1), Yt is a linear combination of
delayed sine functions. In order to simplify that, we have to use a lot of trig
identities. However, there is a much easier way around. We assume that Xt =
eiωt = cos(ωt) + i sin(ωt), compute β(B)Xt and take the real part of the result.
Since
BXt = Xt−1 = eiω(t−1) = e−iω eiωt
and, for a general n,
B n Xt = e−inω eiωt = (e−iω )n eiωt
we have
β(B)Xt = β(e−iω )eiωt
Now, β(e−iω ) is a complex number and it can be represented as
G(ω) = β(e−iω ) = γ(ω)e−iϕ(ω)
where γ(ω) = |β(e−iω )| is a nonnegative real number and ϕ(ω) is defined up to the
multiple of 2π. The function G(ω) is called the gain function of the filter. As we
can see from (1.4), T (ω) = |G(ω)|2 = γ 2 (ω) is the transfer function of the filter.
The function ϕ(ω) is called a phase. In terms of the phase,
β(B)Xt = γ(ω)e−iϕ(ω) eiωt = γ(ω)eiωt−ϕ(ω)
= γ(ω)(cos(ωt − ϕ(ω)) + i sin(ωt − ϕ(ω)))
199
1.0
0.8
0.6
0.4
0.2
100 200 300 400 500
Figure 16. Low pass tangent filter of the order 2, applied in for-
ward and backward direction.
Taking the real part, we get

Yt = γ(ω) cos(ωt − ϕ(ω)).
For a recursive filter
α(B)Yt = β(B)Xt
the gain function is equal to
β(e−iω )
G(ω) =
α(e−iω )
but all the above is applicable.
Discussion. We experience a phase shift (in fact, it is a delay). The value of
the shift depends on the frequency. The higher is the order of the filter, the more
significant is the delay. This effect can be clearly seen on Figures 9 - 11 showing
the result of the filtration with tangent filters of different orders.
There exist two possible settings. 1. Real time filtering. The data arrive one
at a time, we can’t use future values of Xt in the filter, the goal of the filtering
is to decode a signal, and we need it ASAP. If this is the case, then we have an
unpleasant choice between the quality of filtration and the size of the delay which
is unavoidable.
2. Retrospective analysis. The whole data set is available at once, we need to
find out what was the signal. In fact, we are looking for a decomposition
Xt = S̃t + Ñt
where S̃t is an estimated signal. In this setting, any phase shift is unacceptable.
There exist many possible ways to eliminate the phase shift. One solution, an
approximate one, practically eliminates the phase shift (errors are due to the end
effects). According to this method, we apply the same filter twice, once in forward
and again in backward direction. So, we choose a desired filter, say, a low pass
tangent filter, and apply it to the data Xt in order to obtain a filtered series Yt .
After that, we set Xt∗ = Yn+1−t (time reversal) and apply the same filter to Xt∗ .
Denote by Yt∗ the result of the second filtration. Finally, we do a time reversal once
∗
again in order to get the output Zt = Yn+1−t . On the graph, you can see what
happens to the signal plus noise data from previous section if we apply the second
order tangent filter in forward and backward direction.
The resulting transfer function is equal to a square T 2 (ω) of the transfer func-
tion of the original filter. In particular, T (ω0 ) = 1/4.
200
1.0
0.8
0.6
0.4
0.2
ω0 π/4 π/2 3π/4 π
Figure 17. Transfer functions of Potter low pass filters with M =

10, 25, 50 and 75, cut-off point is π/10.
2.0
1.5
1.0
0.5
ω0 π/4 π/2 3π/4 π

10, 25, 50 and 75, cut-off point is ω0 = π/20. The filter with M =
10 no longer works, it actually amplifies harmonics with frequencies
above ω0 instead of suppressing them
2.5
2.0
1.5
1.0
0.5
ω0 π/4 π/2 3π/4 π

10, 25, 50 and 75, cut-off point is π/30. Both filters with M = 10
and M = 25 no longer work.
Another solution (an “exact” one) is to use so-called FIR (finite impulse re-
sponse) filters. We are looking for a filter of the form
k=M
X
(3.2) Yt = bk Xt−k
k=−M
201
3.0
2.5
2.0
1.5
1.0
0.5
π/4 π/2 3π/4 π

10, 25, 50 and 75, cut-off point is π/60. The only filter that is still
working, corresponds to M = 75
1.0
0.8
0.6
0.4
0.2
100 200 300 400 500
Figure 21. Potter low pass filter with M = 75 and cut-off point
π/30 applied to the data shown on Figure 8. Compare it with
other (tangent) filters, especially with the one shown on Figure 16
where bk = b−k . The corresponding gain function

k=M
X
−iω
G(ω) = β(e )= bk e−iωk
k=−M
M
X M
X
−iωk iωk
= b0 + 2 bk (e +e ) = b0 + bk cos(ωk)
1 1
is real valued, so the phase is equal to zero.

How to get a low pass filter with a given cut off point ω0 ? One of possible
solutions, so called Potter 310 filter, has the following structure. We begin with
the weights
sin(2πkω0 )
(3.3) bk =
πk
(obtained from the Fourier transform of the transfer function of the ideal filter).
Unfortunately, the weights bk decay at a rather slow rate. However, the formula
(3.2) requires M points to the right and to the left from t; the bigger is M , the
more significant are the end effects. In order to reduce the interval and still have
a reasonable quality of filtration, we multiply the weights bk by a lag window wk .
202
Finally,
k=M
X
(3.4) Yt = wk bk Xt−k
k=−M
where bk are given by (3.3) and the window wk is given by

" 3
#
ck X πpk
wk = d0 + 2 dp cos
w p=1
M
where (
1
2 k = ±M
ck =
1 otherwise
and
d0 = 1, d1 = 0.684988, d2 = 0.202701, d3 = 0.0177127, w = 2.8108034
Drawbacks: This looks good on paper, but, on practice, M should be large in
order to achieve high quality filtration or to handle narrow bands. If you fix M
and make a band way too narrow, the filter breaks down - its transfer function no
longer approximates the transfer function for the ideal filter. You can see this on
Figures 17, 18, 19 and 20. Also, on Figure 21 you can see a result of application of
Potter filter with M = 75 and cut-off point π/30 to the data shown on the Figure
8.
More drawbacks: If M is adequate, then transfer function of the low pass filter
is indeed a reasonable approximation for the transfer function of the ideal filter.
However, it is not as good for other types of the filters.
Discrete Fourier transform and filtering. It looks like we can achieve
desired results if we compute the discrete Fourier transform of the original data,
multiply the result by desired transfer function and apply the inverse Fourier trans-
form in order to get the “filtered” data. Indeed, if we compare the periodogram
of the “filtered” data with that of the original data, we may decide that we’ve
managed to construct an ideal filter. Unfortunately, it is not correct.
Suppose for example, that the data contains a periodic component A cos(ωt)
with frequency ω. If ω coincides with one of the principal frequencies, then it
will be either preserved or completely eliminated, depending on the value of the
transfer function. However, if ω is not one of the principal frequencies (and why
should it be?), then it will be represented as a linear combination of oscillations
that correspond to the principal frequencies. Only after that, the transfer function
will be applied and some of those components will be dropped out of the series.
To illustrate what may happen, we apply this procedure to the data shown on the
Figure 13. Namely, we are trying to imitate a band reject filter with cut-off points
1.69 ± π/10. The results are shown on Figure 22. And, Figures 23 and 24 show the
periodogram of the original data and of the “filtered” one.
203
40
20
100 200 300 400 500 600

-20
-40
Figure 22. Fourier transform-based band reject “filter” with cut-

off points 1.69 ± π/10 applied to the data shown on Figure 13.
Compare it with Figure 14.
20 000
15 000
10 000
5000
0.0 0.5 1.0 1.5 2.0 2.5 3.0
Figure 23. Periodogram of the original data from Figure 13

20 000
15 000
10 000
5000
0.0 0.5 1.0 1.5 2.0 2.5 3.0
Figure 24. Periodogram of the “filtered” data from Figure 22. It

looks like the “ideal” band reject filter has been applied, isn’t it?
CHAPTER 6
Seasonality
1. Models and Methods

Seasonal effects have a period that is known to us in advance. For instance, they
may show up if we have monthly or quarterly data, with period 12 or 4, respectively.
We have seen some examples of such data in the introduction (Figures 0.2 and 0.3).
More examples could be found below.
For the sake of brevity, let us assume that we have a monthly data, so the
corresponding period of seasonality is 12.
In most cases, seasonal effects are clearly visible with our naked eye. It could
be, however, that presence of seasonal effects is not so obvious, like on Figure 1.
What could we do if this is the case? To begin with, a series like that is, most
likely, non-stationary. Also, it might be that the magnitude of random/seasonal
oscillations is proportional to the achieved level. To reduce those effects, we should
either switch to the increments of the series or to the increments of its logarithm.
When this is done, we can either compute the ACF of the transformed data, or look
at the periodogram instead (better). If seasonal effects are in present, the values
of ACF that correspond to one year delay, two years delay etc., should be large.
However, it depends on how well we managed to eliminate trend effects. In turn,
the periodogram should contain sharp peaks at frequencies proportional to 2π/12,
that is at frequencies that correspond to one year period, six months period, four
months period etc. You can see those effects on Figures 3 and 2. Periodogram is,
in fact, more sensitive — even if we compute the periodogram for the original data,
those peaks can be seen (Figure 4).
To simplify further notation, let t = 1 correspond to January. There exist three
principal ways to handle seasonal series. We briefly discuss them.
1. Parametric approach. We could use, for example, a model
(1.1) Xt = a + bt + St + εt
where Tt = a + bt stands for the trend (we could use another model for the trend,
say parabolic or exponential or any other), and St stands for seasonal component,
so we assume that it is periodic with period 12. They usually call St the seasonal
index (Sj can be interpreted as an adjustment for the month j). We normally
assume that
(1.2) S1 + · · · + S12 = 0
(we always can achieve that by modifying the parameter a). We treat a, b and the
values S1 , . . . , S11 as unknown parameters, and we could use least squares in order
to estimate them.
205
206
Data
1.0
0.8
0.6
0.4
0.2
Months
50 100 150 200 250
Figure 1. Car registration data, since January 1960, 22 years.

This is a series with seasonal effects though we can’t see them
with our naked eye
Periodogram
0.07
0.06
0.05
0.04
0.03
0.02
0.01
Periods
12 6 4 3 2
Figure 2. Periodogram of the increments of the logarithm of the

car registration data. Peaks at the frequency π/6 and its multiples
are clearly visible
ACF
1.0
0.5
Lags
1 5 10 15 20 25 30
-0.5
-1.0
Figure 3. Sample ACF of the increments of the logarithm of the

car registration data. Note the values that correspond to lags 12,
24 etc.
In fact, this particular model fits exactly into the multiple linear regression
(1) (12) (1)
scheme. We define extra variables Xt , . . . Xt as follows: Xt = 1 if t corre-
(1)
sponds to January, otherwise Xt = 0, and so on. Then the model (1.1) can be
207
Periodogram
0.14
0.12
0.10
0.08
0.06
0.04
0.02
Periods
12 6 4 3 2
Figure 4. Periodogram of the original data. Since the series is

not stationary, it contains a tall peak at zero. However, we can see
smaller peaks that correspond to seasonal effects
rewritten as
12
(j)
X
Xt = bt + Sj Xt + εt
j=1
After applying the least squares, we can set a = (S1 + · · · + S12 )/12 and subtract
a from every Sj , in order to satisfy (1.2).
If the magnitude of the seasonal effect clearly looks proportional to the data
(more common), we have a choice between the preliminary logarithmic transforma-
tion and additive model (1.1) or a multiplicative model given by the equation
(1.3) Xt = (a + bt)St + εt
where Tt = a + bt is the (linear) trend, and St is the multiplicative seasonal index.
In this case, Sj should be positive, and the condition (1.2) makes no sense. It is
commonly replaced either by the condition
S1 + · · · + S12
=1
12
or by the condition
S1 · · · · · S12 = 1
However, estimation of the parameters of the multiplicative model is no longer a
linear regression.
2. Seasonal Exponential Smoothing. The second possibility is to adopt the
ideas of the exponential smoothing (so called Holt-Winters model ). It is designed
to handle the case when the coefficients of the previous model (trend coefficients
and seasonal indices) fluctuate in time. We assume that the series can be locally
approximated by the multiplicative model (1.3). Each time, we compare the one
step ahead forecast with the actual data, and adjust the coefficients accordingly.
Ordinary, a (non-seasonal) exponential smoothing depends on the smoothing
parameter α which determines the sensitivity of the model (how strongly does the
model react to fluctuations). In order to make the seasonal version of the method
more flexible, they have introduced three smoothing parameters α, γ, δ, one of them
controls the adjustment of the current level a, the second controls the slope b and
the last controls the adjustment of the seasonal index S. To be precise, we assume
208
that the k steps ahead forecast, constructed at time n, is given by the formula
X̂n+k (n) = (an + bn k)Sn−12+k
where at , bt , St are the coefficients of the model used at time t. The adjustment
equations are as follows:
Xn
an = α + (1 − α)(an−1 + bn−1 )
Sn−12
bn = γ(an − an−1 ) + (1 − γ)bn−1
Xn
Sn = δ + (1 − δ)Sn−12
an
As we can see, if all the smoothing constants are (practically) zeroes, then no
adjustment occurs (except the obvious an = an−1 + bn−1 due to the change of the
origin from n − 1 to n).
An additive version of this model is called the Theil-Wage model. This time, we
assume that the data can be locally approximated by additive model (1.1). Hence,
the k steps ahead forecast, constructed at time n, equals
X̂n+k (n) = an + bn k + Sn−12+k .
As above, we have three smoothing parameters α, γ, δ, one of them controls the
adjustment of the current level a, the second controls the slope b and the last
controls the adjustment of the seasonal index S. The corresponding adjustment
equations are
an = α(Xn − Sn−12 ) + (1 − α)(an−1 + bn−1 )
bn = γ(an − an−1 ) + (1 − γ)bn−1
Sn = δ(Xn − an ) + (1 − δ)Sn−12 .
Practical recommendation: Try to minimize average one step ahead prediction
error as a function of the smoothing parameters. If any of the optimal smoothing
parameters (especially α) turns out to be greater than .25, the model is not working
properly.
3. A seasonal version of the Box-Jenkins (ARIMA) model. Suppose a
series Xt has the structure (1.1). Consider a new series
Yt = Xt − Xt−12 = (1 − B 12 )Xt
(Recall that the operator ∇12 = 1 − B 12 is called the seasonal difference). As we
can see,
Yt = 12b + εt − εt−12
is a stationary series. Moreover, it is a moving average process of the order 12
(though it is not invertible). If the trend is not linear, we may have to take another
difference and consider
Zt = ∇∇12 Xt
In our case, Zt = εt − εt−1 − εt−12 + εt−13 . However, it is stationary (and even with
zero expectation). Based on this observation, the following procedure has been
suggested.
a. Make sure the seasonal component is not multiplicative (take logarithm if
necessary).
209
Sales
12 000
10 000
8000
6000
4000
2000
Months
0 20 40 60 80 100 120 140
Figure 5. Sales (Dry cleaning)
b. Take seasonal difference ∇12 or maybe seasonal difference and ordinary

difference ∇∇12 (denote the result by Zt ) Expect Zt to be stationary (if it is not,
the method is not working).
c. However, the series Zt may show significant correlation for small lags (a few
months delay) as well as lags proportional to 12 (one year, two years, so on). In fact,
we could expect such things in the original series as well. Taking the differences
only makes things worse.
For this reason, we look for the model of the following structure:
α(B)αs (B 12 )Zt = β(B)βs (B 12 )εt
where α, αs , β and βs are polynomials. Without αs and βs , it is just an ARMA
model. However, if we limit ourselves by plain ARMA models, we may have to
consider really big orders to accommodate one year dependencies. In order to
reduce the number of parameters, two more polynomials have been added. They
are called seasonal autoregression and seasonal moving average.
In most cases, it is enough to consider linear or quadratic polynomials.
2. Examples
In all the examples below, we set aside twelve last points (one year). We con-
struct a prediction using one of the methods described above, and compare it with
actual data.
1. As a first example, consider monthly sales of the dry cleaning (Figure
5). This is a monthly data, 12 years, 144 points. The seasonal effects definitely
look proportional to the current level, so we could try a linear trend model with
multiplicative seasonality. Our second option is to take a logarithm and then try
a linear or quadratic trend with additive seasonality. On Figures 6 and 7, you can
see a prediction made at the end of eleventh year, compared with the actual data.
As we can see from the graphs, multiplicative model is not working properly (one
third of points is out of confidence region).
We could also use a seasonal version of the exponential smoothing. For the
multiplicative model (Holt-Winters), optimization of the smoothing constants gives
α = 0.14, γ = 0.067 and δ = 0.415. Second option is to take logarithm and apply an
additive version of the model (Theil-Wage). Optimization of the parameters gives
us α = 0.232, γ = 0.085 and δ = 0.506 In both cases, the parameter δ (the one
210
14 000
12 000
10 000
8000
6000
130 132 134 136 138 140 142 144
Figure 6. Linear trend and multiplicative seasonality model fit-

ted to the first eleven years of the data. You can see here a predic-
tion together with 95 % confidence bounds, and compare it with
the actual data. Apparently this model is not working properly -
a number of points is outside of 95 % confidence region.
14 000
12 000
10 000
8000
6000
130 132 134 136 138 140 142 144
Figure 7. Quadratic trend and additive seasonality fitted to log

of the data (again, to the first eleven years of them). Looks much
better.
14 000
12 000
10 000
8000
6000
130 132 134 136 138 140 142 144
Figure 8. Seasonal Exponential Smoothing (multiplicative model)
responsible for adjustment of the seasonal coefficients) looks a bit large. However,
as you can see on the graphs (Figures 8 and 9), prediction looks very good.
Finally, we could take a logarithm and apply a seasonal version of the Box-
Jenkins (ARIMA) model. We take simple and seasonal differences in order to make
211
14 000
12 000
10 000
8000
6000
130 132 134 136 138 140 142 144
Figure 9. Seasonal Exponential Smoothing (additive model ap-

plied to the logarithm of the data)
14 000
12 000
10 000
8000
6000
130 132 134 136 138 140 142 144
Figure 10. Seasonal ARIMA(1, 1, 0) × (1, 1, 1)12 model
the series stationary. Models of different structure look good here, in particular,
the following one
(1 + 0.429B)(1 − 0.29B 12 )∇∇12 Xt = (1 − 0.9941B 12 )εt
(see Figure 10). However, estimated value of the seasonal moving average parameter
is very close to −1 which makes the model very close to a non-invertible one.
Because of that, trend models look more reasonable.
2. Another data set (Figure 11), also 12 years, 144 points, represents a num-
ber of air passengers. Seasonal effects are not so stable here, since some holidays
move around (for instance, Easter). Once again, we may begin with trend models
(multiplicative looks more reasonable, see Figure 12). Exponential smoothing does
not seem to work here, because optimal value of one of the smoothing parameters
turns out to be 0.95. ARIMA models seem very good. Again, various models could
be tried, in particular, this one:
(1 + 0.06B 12 )Xt = (1 − 0.35B)(1 − 0.59B 12 )εt
(see Figure 13). Here, p-values for all estimated parameters are practically zeroes
(so all of them are significant), and the p-value for the Portmanteau test for the
residuals is 0.8163 (so it is a white noise).
3. Our next example is women unemployment data (in UK? 67 points, 5.5
years, since January 1967, Figure 14). Seasonal effects look proportional to the
level here, so we can either try a trend model with multiplicative seasonality, or
212
Data
600
500
400
300
200
100
Months
20 40 60 80 100 120 140
Figure 11. Air passengers data
700
600
500
400
130 132 134 136 138 140 142 144
Figure 12. Air passengers data. Trend with multiplicative seasonality.
700
600
500
400
130 132 134 136 138 140 142 144
Figure 13. Air passengers data. Seasonal ARIMA(0, 1, 1) ×

(1, 1, 1)12 model
we could do multiplicative version of exponential smoothing, or we could take the

logarithm and use a seasonal version of ARIMA model. However, it is not obvious
how to parameterize the trend (for sure, it is not linear. There is absolutely no
reason for the trend to be quadratic. And, if we try quadratic trend nonetheless,
we get a prediction that does not agree with the data - Figure 15). Neither of
the exponential smoothing models is working as well (if we optimize smoothing
constants, we get α = 0.88, δ = 1 for Holt-Winters model and α = 0.669, δ = 1 for
Theil-Wage model). However, ARIMA models are quite adequate here, for instance
213
Data
1.3
1.2
1.1
1.0
0.9
0.8
0.7
Months
0 10 20 30 40 50 60
Figure 14. Women unemployment, since January 1967, 5.5 years.
Data
1.3
1.2
1.1
1.0
0.9
0.8
0.7 Months
45 50 55 60 65
Figure 15. Women unemployment data, multiplicative trend

model Xt = (1.06 + 0.02t + 0.003t2 )St + εt . The prediction looks
biased, apparently the model is not working
Data
1.4
1.2
1.0
0.8
Months
45 50 55 60 65
Figure 16. Women unemployment data, seasonal

ARIMA(1, 1, 0) × (0, 1, 0)12 model
this one:
(1 − 0.3487B)∇∇12 Xt = εt
4. Temperature data (since January 1960, 20 years). As we can see, there is
no trend. Hence, there is no reason to consider logarithms or multiplicative trend
model. However, we still may consider a model
Xt = a + St + εt
214
Data
8
6
4
Months
0 50 100 150 200
Figure 17. Temperature since January 1960, 20 years.
Data
6.5
6.0
5.5
5.0
4.5
4.0
3.5
3.0 Months
210 215 220 225 230 235 240
Figure 18. Temperature data - trend type model Xt = 4.9022 +

St + εt
Data
6.5
6.0
5.5
5.0
4.5
4.0
3.5
3.0 Months
210 215 220 225 230 235 240
Figure 19. Temperature, seasonal ARIMA(1, 0, 0) × (1, 1, 1)12 model
where a is a constant and St stands for seasonal component (see Figure 18). Equally,
we may consider an ARIMA type model, the following one looks alright:
(1 − 0.302B)(1 + 0.292B 12 )∇12 Xt = (1 − 0.871B 12 )εt
(see Figure 19). Note that, since there is no trend at all, we only take seasonal
difference. Exponential smoothing (Theil - Wage) works here as well, corresponding
smoothing parameters are α = 0.137, γ = 0.065 and δ = 0.214. However, the series
does not contain trend so model should be modified - there should be no γ at all.
5. Residential electricity (since January 1971, 9 years). The data is shown on
Figure 21. We can see some trend here, though it is probably not linear (changes
in population?). Also, we expect seasonal effects to be multiplicative, not addi-
tive (proportional to population size?). Estimated model is shown on Figure 22.
Next, we can try an ARIMA type model. Since we expect seasonal effects to be
215
Data
6.5
6.0
5.5
5.0
4.5
4.0
3.5
3.0 Months
210 215 220 225 230 235 240
Figure 20. Temperature data, exponential smoothing (modifica-

tion of Theil - Wage)
Data
8
6
Months
0 20 40 60 80 100
Figure 21. Residential electricity since January 1971, 9 years.
Data
3 Months
80 85 90 95 100 105
Figure 22. Electricity data - trend model Xt = (4.5 + 0.0106t −

0.00055t2 )St + εt
multiplicative, we take the logarithm first. The following model looks good enough:
(1 − 0.328B)(1 + 0.624B 12 )∇12 log Xt = εt
(see Figure 23). Exponential smoothing (Holt - Winters) works somewhat (Figure
24), optimal smoothing parameters are α = 0.18, γ = 0.035 and δ = 0.552. And,
the parameter δ (responsible for the adjustment of seasonal coefficients) is way too
big.
6. Our last example is the car registration data briefly discussed above (22
years, since January 1947, see Figure 1). The series definitely contains a trend.
However, if we look at the graph, we realize that no parametric model could possibly
work here. Our only hope is a seasonal ARIMA model, and indeed, the following
one looks alright:
(1 + 0.216B + 0.29B 2 )(1 + 0.593B 12 + 0.402B 24 + 0.254B 36 )∇∇12 Xt = εt
216
Data
3 Months
80 85 90 95 100 105
Figure 23. Electricity data, seasonal ARIMA(1, 0, 0) × (1, 1, 1)12 model
Data
3 Months
80 85 90 95 100 105
Figure 24. Electricity data, exponential smoothing (Holt - Winters)
Data
1.0
0.8
0.6
0.4
0.2 Months
245 250 255 260
Figure 25. Car registration data, seasonal ARIMA(2, 1, 0) ×

(3, 1, 0)12 model
(see Figure 25). Exponential smoothing is another option (Figure 26), however
corresponding values of the smoothing parameters are α = 0.54, γ = 0.006, δ = 0.5,
and two of the parameters are too big.
Exercises
1. Let Xt = (a + bt)St + εt , where St = St+12 for all t. Does the transformation
∇12 Xt = Xt − Xt−12 make the series stationary? If no, find the one which does.
217
Data
1.0
0.8
0.6
0.4
0.2 Months
245 250 255 260
Figure 26. Car registration data, exponential smoothing (Theil - Wage)

CHAPTER 7
Multivariate models
1. Cross-covariance, cross-correlation and cross-spectrum

Let Xt and Yt be two stationary series. The cross-covariance function (CCV)
of X and Y is defined by the formula
CXY (k) = Cov(Xt , Yt+k ) = EXt Yt+k − EXt EYt+k
As we can see, the autocovariance function RX (k) = CXX (k) coincides with the
cross-covariance function of X with itself. Also, CXY (k) = CY X (−k) (equivalent
of the property RX (k) = RX (−k)).
Dividing the cross-covariance function by the product of the standard devia-
tions of X and Y , we get a cross-correlation function (CCF)
CXY (k)
ρXY (k) = Corr(Xt , Yt+k ) =
σX σY
As for CCV, ρXY (k) = ρY X (−k). Also,
|ρXY (k)| ≤ 1
(Schwartz inequality). Cross-correlation and cross-covariance functions represent
relations between two series. In particular, if X and Y are independent, CXY (k) = 0
for all k. On the other hand, if ρXY (l) = ±1 for some l, then there exist such
coefficients a, b that Yt+l = aXt + b for all t.
Examples. 1. Let Xt be a stationary process with autocovariance function
RX (k), and let ηt be a white noise independent from Xt . Finally, let Yt = Xt + ηt
where ηt is a white noise independent from Xt . For instance, such a situation
appears if we measure Xt with an error. Then
σY2 = σX
2
+ ση2
and
CXY (k) = Cov(Xt , Yt+k ) = Cov(Xt , Xt+k + ηt+k ) = RX (k)
for all k. Therefore the cross-correlation function
R (k)
ρXY (k) = qX
2 + σ2
σX σX η
2. Suppose εt and ηt are two independent white noises with expectation zero
and variances σε2 and ση2 . Let Xt and Yt be given by the formulas
Xt = εt + bεt−1 + βηt−1
Yt = ηt + cηt−1 + γεt−1
219
220
Then
Var(Xt ) = (1 + b2 )σε2 + β 2 ση2 ,
Var(Yt ) = (1 + c2 )ση2 + γ 2 σε2 ,
and
RX (±1) = bσε2 ,
RY (±1) = cση2 ,
and RX (k) = RY (k) = 0 if |k| ≥ 2. Next,
CXY (0) = Cov(Xt , Yt ) = bγση2 + cβσε2

CXY (1) = Cov(Xt , Yt+1 ) = γσε2
CXY (−1) = Cov(Xt , Yt−1 ) = βση2
CXY (k) = 0, if |k| ≥ 2
Estimation of the cross-covariance function. Assume that we have two

time series X1 , . . . , XN and Y1 , . . . YN of the same length (and, of course, with the
same interval between observations). Similar to what was done for the autocovari-
ance function, we set
N −k
1 X
ĈXY (k) = (Xt − X̄)(Yt+k − Ȳ ),
N 1
(1.1) N
1 X
ĈXY (−k) = (Xt − X̄)(Yt−k − Ȳ ),
N
k+1
for k = 0, 1, . . . , N − 1. Here X̄ and Ȳ stand for the sample means of X and Y .

The estimate for the cross-correlation function can be obtained from here by the
formula
ĈXY (k)
(1.2) ρ̂XY (k) = q
R̂X (0)R̂Y (0)
Remark. It could be shown that ĈXY and ρ̂XY are consistent, asymptotically
normal and asymptotically unbiased.
Suppose both series have the expectation zero (known to us). Than the estimate
(1.1) takes form
N −k
1 X
ĈXY (k) = Xt Yt+k ,
N 1
(1.3) N
1 X
ĈXY (−k) = Xt Yt−k ,
N
k+1
N −|k|
and its expectation is equal to E ĈXY (k) = N CXY (k).
221
Suppose now that the series X and Y are, in fact, independent. Then E ĈXY (k) =
0 and
N −k N −k
1 X X
Var ĈXY (k) = E(ĈXY (k))2 = E(Xt Yt+k Xs Ys+k )
N 2 s=1 t=1
N −k N −k
1 X X
= RX (t − s)RY (t − s)
N 2 s=1 t=1
u=N −k−1
1 X
= (N − k − |u|)RX (u)RY (u)
N2
u=−(N −k)+1
For small k,
1 X−1
u=N
Var(ĈXY (k)) ≈ (N − |u|)RX (u)RY (u)
N2
u=−N +1
1 N −1 2
= (RX (0)RY (0) + 2 RX (1)RY (1) + · · · + RX (N − 1)RY (N − 1))
N N N
2 2
σX σY N −1 2
= (1 + 2 ρX (1)ρY (1) + · · · + ρX (N − 1)ρY (N − 1))
N N N
Finally,
1 N −1 2
(1.4) Var ρ̂XY (k) ≈ (1 + 2 ρX (1)ρY (1) + · · · + ρX (N − 1)ρY (N − 1))
N N N
Suppose now that one of the series is actually a white noise. Then Var ρ̂XY (k) ≈
1/N and
2
P {|ρ̂XY (k)| ≥ √ } ≈ .95.
N
However, if both series have a non-trivial correlation function, then the confidence
interval may be much bigger. A following procedure allows us to get around this
problem.
Step 1. (Pre-whitening) Construct a transformation
X̃t = α(B)Xt
such that X̃t is (approximately) a white noise (for instance, fit an AR(l) model of
a reasonable order).
Step 2. If Xt and Yt are independent, then X̃t and Yt are still independent,
and we can use the confidence intervals constructed above.
P∞Cross-spectrum. Suppose CXY (k) → 0 as k → ±∞, so that the series
k=−∞ |CXY (k)| converges. A cross-spectral density of X and Y is defined by the
formula
∞
1 X
fXY (ω) = CXY (k)e−iωk , −π ≤ ω ≤ π
2π −∞
Since CXY (k) is not even, the cross-spectral density is no longer a real-valued
function. However, we still have
Z π
CXY (k) = eiωk fXY (ω) dω
−π
222
As a complex-valued function, fXY can be represented as

fXY (ω) = cXY (ω) − iqXY (ω)
where
∞
1 X
cXY (ω) = CXY (k) cos(ωk)
2π −∞
is called the co-spectrum, and
∞
1 X
qXY (ω) = CXY (k) sin(ωk)
2π −∞
is called the quadrature spectrum. The other possibility is use the representation
fXY (ω) = αXY (ω)eiϕXY (ω)
where αXY is called the cross-amplitude spectrum, and ϕXY is called the phase (no
surprise). Graphs of these functions are really hard to interpret. However, we can
compare, say, the cross-amplitude spectrum with spectral densities of X and Y .
This way we arrive at the following functions
2
αXY (ω)
C(ω) =
fX (ω)fY (ω)
which is called the coherency function, and
αXY (ω)
GXY (ω) =
fX (ω)
which is called the gain function.
It could be shown that 0 ≤ C(ω) ≤ 1. In a sense, C(ω) is a square of the
correlation coefficient at the frequency ω.
Examples. 1. Suppose Yt = Xt −bXt−1 (and both series have the expectation
zero). Then
CXY (k) = EXt Yt+k = EXt Xt+k − bEXt Xt+k−1
= RX (k) − bRX (k − 1) = (1 − bB)RX (k)
Hence
1 X
fXY (ω) = CXY (k)e−iωk
2π
k
1 X
= (RX (k) − bRX (k − 1))e−iωk
2π
k
1 X 1 X
= RX (k)e−iωk − be−iω RX (k − 1))e−iω(k−1)
2π 2π
k k
−iω −iω
= fX (ω) − be fX (ω) = (1 − be )fX (ω)
In a similar way, if α(B)Yt = β(B)Xt , then
β(e−iω )
fXY (ω) = fX (ω)
α(e−iω )
Therefore
β(e−iω )

aXY (ω) =
fX (ω)
α(e−iω )
223
and the coherency

β(e−iω ) 2

C(ω) =
fX (ω)/fY (ω) = 1
α(e−iω )
by (1.14). So, the series Yt does not contain any extra information which Xt does
not contain already.
2. Suppose now that Yt = Xt−l = B l Xt is a delayed series Xt . According to
Example 1,
fXY (ω) = e−iωl fX (ω)
and therefore the phase ϕ(ω) = −lω is a linear function of the frequency.
3. Suppose now Yt = Xt + ηt where ηt is a stationary series independent from
Xt . We can easily see that
CXY (k) = Cov(Xt , Xt+k ) + Cov(Xt , ηt+k ) = RX (k)
For this reason, fXY (ω) = fX (ω) is a real-valued function and the coherency
fX (ω)
C(ω) =
fY (ω)
However,
fY (ω) = fX (ω) + fη (ω)
and therefore
1
C(ω) = fη (ω)
1+ fX (ω)
Estimation of the cross-spectrum. All what was said about the estimation
of the spectral density, is applicable here as well. Once again, we either use a lag
window
N −1
1 X
fˆXY (ω) = λk ĈXY (k)e−iωk
2π
−N +1
where
k
λk = λ( ),
M
or a spectral window:
Z π
fˆXY (ω) = IXY (ω)k(ω) dω
−π
where
k(ω) = M K(M ω)
and
1 X X
IXY (ω) = Xt eiωt Yt e−iωt
2πN
is the cross-periodogram.
To illustrate the concept, we consider the electromagnetic activity of a human
brain.
224
CCF
1.0
0.5
Lags
-15 -10 -5 5 10 15
-0.5
-1.0
Figure 1. Estimated Cross-Correlation function between the left

and the right hemisphere of a human brain.
CrossSpectrum
100 000
70 000
50 000
30 000
20 000
15 000
10 000
Periods
30 10 5 4 3 2
Figure 2. Power cross-spectrum for the left and the right hemi-
sphere. The highest of the sharp peaks is a resonance frequency
of the meter, the second one is the frequency of the electric power
generator.
Phase
150
100
50
Periods
-50 30 10 5 4 3 2
-100
-150
Figure 3. Phase of the cross-spectrum for the left and the right
hemisphere. Kind of hard to interpret.
2. State-Space models. Kalman Filter

State-space model. This model looks rather exotic by itself. However, it has
a number of applications.
We assume that the state of the system is given by a (column) vector Xt of
dimension k. Evolution of the system is described by the following equation, called
225
Coherency
1.0
0.8
0.6
0.4
0.2
Periods
30 10 5 4 3 2
Figure 4. Coherency function for the left and the right hemisphere.
CCF
1.0
0.5
Lags
-15 -10 -5 5 10 15
-0.5
-1.0
Figure 5. Cross-Correlation between different human brains

(they should be independent, shouldn’t they?).
Coherency
1.0
0.8
0.6
0.4
0.2
Periods
30 10 5 4 3 2
Figure 6. Coherency function, different human brains.
transitional, or system, equation:

(2.1) Xt = At Xt−1 + et + at
where At is k × k matrix, at is a non-random column vector and the noise compo-
nents et (again, a column vector) are independent for different t, have expectation
zero and covariance matrix Cov et = Qt . The equation (2.1) defines a distribution
of Xt if we say something about X0 . We assume that the vector X0 has the expec-
tation EX0 = x0 and the covariance matrix Cov(X0 ) = P0 . Also, we assume that
the noise et is independent from X0 .
226
To simplify our life, we assume that the vectors X0 , e1 , . . . , eN are Gaussian

(and therefore everything has a Gaussian distribution).
It looks like (2.1) is limited to the first order autoregression. However, it is not
true. For instance, imagine that a time series Xt is a second order autoregression
with a constant added,
Xt = b1 Xt−1 + b2 Xt−2 + b3 + εt
In order to squeeze this equation into the form (2.1), we set
     
Xt b1 b2 b3 εt
Xt = Xt−1  , At = A =  1 0 0 et =  0 
1 0 0 1 0
Essentially, we have included Xt−1 into the present state of the system and we
don’t even use the vector at .
Lets now continue with the setup. Unfortunately, we can’t observe Xt . What
we can observe is another (column) vector Yt of dimension m. We assume that Yt
is related to Xt by the following, so called measurement, equation
(2.2) Yt = Ht Xt + ft
where Ht is an m × k matrix and ft is another noise, Eft = 0, Cov ft = Rt , ft
are independent from es and X0 , ft are independent for different t, and ft are also
Gaussian.
Our ultimate goal is to estimate Xt given the observations Y1 , . . . , Yt , or, in
short, given Y≤t . However, we begin with examples which provide possible inter-
pretations for this model.
1. Steady model. Xt = Xt is a one-dimensional process satisfying
(2.3) Xt = Xt−1 + εt
(which stands for the transition equation). As we know, Xt = X0 + ε1 + ε2 + · · · + εt
is a sequence of sums of i.i.d. random variables. This can be considered as a
model of random fluctuations of the level. Now, suppose we observe Xt with some
(measurement) error:
Yt = Xt + ηt
(the measurement equation) where ηt is another noise, and our goal is to estimate
the current level Xt .
2. Linear growth model. This time, we assume that Xt grows in time but
the rate of growth fluctuates randomly and, once again, we observe it with random
error. So, we assume that
Xt = Xt−1 + at−1 + ε0t
at = at−1 + ε00t
Yt = Xt + ηt
In order to represent this situation as a state-space model, we set
0
Xt 1 1 ε
Xt = At = et = 00t
at 0 1 εt
for the transition equation, and

Yt = Yt , Ht = 1 0 , ft = ηt
for the measurement equation.
227
3. Missing data model. Suppose that the process Xt , t = 1, . . . , N is, for

instance, the autoregression of the first order
Xt = aXt−1 + εt
where a and σε2 are known to us (or have been already estimated). However, one
of the values, say, Xτ , is missing. Our goal is to estimate Xτ given all Xt , t 6= τ .
This is the case when we have to take advantage of the model’s flexibility (cor-
responding matrices At and Ht will depend on t). In order to design the transition
equation, we set

Xt Xt
Xt = , t≤τ Xt = , t>τ
0 Xτ
and
a 0 a 0
At = t 6= τ + 1 Aτ +1 =
0 1 1 0
Finally,
ε
et = t
0
As a result, the transition equation looks as follows:

Xt a 0 Xt−1 ε
= + t , t≤τ
0 0 1 0 0

Xτ +1 a 0 Xτ ε
= + τ +1 , t=τ +1
Xτ 1 0 0 0

Xt a 0 Xt−1 ε
= + t , t>τ +1
Xτ 0 1 Xτ 0
For the measurement equation, we set

Ht = 1 0 t=6 τ, Hτ = 0 0
and ft = 0 for all t. So, the measurement equation implies
Yt = Xt , t 6= τ Yτ = 0
Our goal is to construct an estimate for XN given Y≤N = X{t6=τ } . However, the
first component of XN is equal to YN = XN (so its estimation is trivial) and the
second component of XN is Xτ .
Kalman filter. As we know, the best predictor for Xt is the conditional
expectation E(Xt |Y≤t ). In the Gaussian case, it coincides with the best linear
predictor. Denote by
Xt|s = E(Xt |Y≤s )
We are interested in Xt|t for all t. Clearly, X0|0 = EX0 = x0 (no observations are
available at time zero, so the conditional expectation is equal to the non-conditional
one). Beyond that, we will present a two step iteration scheme which allows us to
move on, first, from Xt−1|t−1 to Xt|t−1 and after that, from Xt|t−1 to Xt|t .
The vector Xt − Xt|s represents the prediction error. We denote by
Pt|s = E(Xt − Xt|s )(Xt − Xt|s )T
the covariance matrix of the prediction error. At time zero,
X0 − X0|0 = X0 − x0
228
and therefore
P0|0 = P0 .
First step. Suppose Xt−1|t−1 and Pt−1|t−1 are known. By the transition equa-
tion,
(2.4) Xt = At Xt−1 + et + at
Taking the conditional expectation with respect to Y≤t−1 , we therefore get
(2.5) Xt|t−1 = At Xt−1|t−1 + at
(et disappeared because Y≤t−1 depends on X0 , e0 , . . . , et−1 , f1 , . . . , ft−1 and is in-
dependent from et ). Subtracting (2.5) from (2.4), we get an expression for the
prediction error
Xt − Xt|t−1 = At (Xt−1 − Xt−1|t−1 ) + et
(at cancels out). Therefore
Pt|t−1 = E(Xt − Xt|t−1 )(Xt − Xt|t−1 )T
(2.6) = At E(Xt−1 − Xt−1|t−1 )(Xt−1 − Xt−1|t−1 )T ATt + Eet eTt
= At Pt−1|t−1 ATt + Qt
Second step. Suppose now that Xt|t−1 and Pt|t−1 are known, and Yt arrived.
We have
Yt = Ht Xt + ft
and therefore
Yt|t−1 = Ht Xt|t−1
where Yt|t−1 = E(Yt |Y≤t−1 ) is the prediction for Yt given Y≤t−1 . Once again, ft
disappears because it is independent from Y≤t−1 . Denote by
Zt = Yt − Yt|t−1 = Ht (Xt − Xt|t−1 ) + ft
the prediction error. Denote by Ft the covariance matrix of Zt . It is equal to
Ft = EZt ZTt = E(Yt − Yt|t−1 )(Yt − Yt|t−1 )T
(2.7)
= Ht Pt|t−1 HtT + Rt
It is easy to see that Zt is uncorrelated with Y≤t−1 (and therefore independent from
it). In a sense, Zt represents the new information which is contained in Yt . Using
this property, one can show that the best predictor for Xt given Y≤t is equal to
Xt|t = Xt|t−1 + E(Xt |Zt )
Since everything is Gaussian, the conditional expectation E(Xt |Zt ) can be repre-
sented in terms of the covariance matrix of Xt and Zt , and the covariance matrix
of Zt which is Ft . Namely,
E(Xt |Zt ) = E(Xt ZTt )Ft−1 Zt
where
E(Xt ZTt ) = EXt (Xt − Xt|t−1 )T HtT + EXt ft
= E(Xt − Xt|t−1 )(Xt − Xt|t−1 )T HtT + EXt|t−1 (Xt − Xt|t−1 )T HtT
= Pt|t−1 HtT
229
All the other terms disappear because ft is independent from Xt and (Xt − Xt|t−1 )
is independent from Y≤t−1 and therefore from Xt|t−1 as well. Finally, we get
(2.8) Xt|t = Xt|t−1 + Pt|t−1 HtT Ft−1 (Yt − Ht Xt|t−1 )
which implies
Xt − Xt|t = Xt − Xt|t−1 − Pt|t−1 HtT Ft−1 (Yt − Ht Xt|t−1 )
and
(2.9) Pt|t = Pt|t−1 − Pt|t−1 HtT Ft−1 Ht Pt|t−1
Combining both steps together, we get
(2.10) Xt+1|t = At+1 Xt|t−1 + At+1 Pt|t−1 HtT Ft−1 (Yt − Ht Xt|t−1 )
and
(2.11) Pt+1|t = At+1 (Pt|t−1 − Pt|t−1 HtT Ft−1 Ht Pt|t−1 )ATt+1 + Qt
(note that the matrix Ft given by (2.7), depends on Pt|t−1 ). The matrix
Kt = At+1 Pt|t−1 HtT Ft−1
from (2.10) is called the gain matrix for the Kalman filter.
APPENDIX A
Elements of Probability
1. Basic Concepts
A1.1. Sample space, Events, Probabilities. We begin with a concept of
a sample space, typically denoted by Ω. It is defined as a collection of all possible
outcomes. Elements of the sample space ω ∈ Ω are called outcomes or elementary
events. Subsets A ⊂ Ω of the sample space are called events (intuitively, an event
is a collection of outcomes with certain property).
Events A1 , A2 , . . . are called mutually exclusive, or disjoint, if Ai ∩ Aj = ∅ for
i 6= j.
To every event A, there corresponds its probability P (A). The probabilities
P (A) must satisfy the following properties (Axioms of Probability).
1. 0 ≤ P (A) ≤ 1.
2. P (Ω) = 1.
3. (countable additivity or σ-additivity). If events A1 , A2 , . . . are disjoint, then
∞
X
P (∪∞
n=1 An ) = P (An )
n=1
Intuitively, P (A) is the frequency of the occurrence of the event A if we repeat the
same random experiment again and again. So, if P (A) = 0, then the event A is
impossible. If P (A) = 1, then A will definitely occur.
Further properties of the probabilities. Axioms of the probability imply:
1. P (∅) = 0.
n
Pn 2. (Finite additivity) If A1 , . . . , An are mutually exclusive, then P (∪k=1 Ak ) =
k=1 P (Ak ).
3. If Ac = Ω \ A is the complement of A, then P (Ac ) = 1 − P (A)
4. (Monotonicity) If A ⊂ B, then P (A) ≤ P (B).
5. (Continuity) (a) Let An be a sequence of events such that An ⊂ An+1 for
all n. Then P (∪∞ n=1 An ) = limk→∞ P (An ).
(b) Let An be a sequence of events such that An ⊃ An+1 for all n. Then
P (∩∞n=1 An ) = limk→∞ P (An ).
See Problems 1-7 at the end of the section.
Conditional Probabilities. Suppose we know that the event B has occurred.
Can we say anything about another event A? To address this problem, we define a
conditional probability of A given B by the formula
P (AB)
P (A|B) =
P (B)
(provided P (B) 6= 0). In a sense, B acts as a new sample space, only outcomes
ω ∈ B are possible. There are two major formulas related to the concept.
231
232
Law of a total probability. If the events B1 , . . . , Bn are disjoint and Ω = ∪n1 Bi ,

then
Xn
P (A) = P (A|Bi )P (Bi )
i=1
Bayes formula. Again, if the events B1 , . . . , Bn are disjoint and Ω = ∪n1 Bi ,
then
P (A|Bj )P (Bj )
P (Bj |A) = Pn
i=1 P (A|Bi )P (Bi )
A1.2. Independent events. Two events A and B are called independent if
P (AB) = P (A)P (B)
Independency is equivalent to the property P (A|B) = P (A) (information about the
event B does not affect the probability of A). In case of three or more events, the
definition becomes more complicated. Namely, events A1 , . . . , An , (. . . ) are called
independent if
P (Ai1 Ai2 . . . Aik ) = P (Ai1 )P (Ai2 ) . . . P (Aik )
for every sub-collection Ai1 , . . . , Aik (that is, for all pairs, all triplets, etc.).
Important property. Suppose events A1 , A2 , . . . are independent. The indepen-
dency will not be broken if any of the events is replaced by its complement (see
Problem 9).
A1.3. Random Variables. A random variable X(ω) is a real-valued function
on a sample space Ω.
Distribution function, or cumulative distribution function, or c.d.f of a random
variable X is defined by the formula F (x) = FX (x) = P {X ≤ x}. Distribution
function is an increasing right continuous function, F (−∞) = 0, F (∞) = 1. It
describes the distribution entirely (probability of every event related to X can be
somehow expressed in terms of FX ).
Discrete random variables. Random variable X has a discrete distribution
if the set of its possible values is finite or countable. Distribution function of a
discrete random variable is a step function, it jumps up at points that are possible
values of X. Distribution of a discrete random variable can be characterized also
by its probability mass function:
p(x) = pX (x) = P {X = x}
A probability mass function has the following properties:
1. p(x) 6= 0 only for a finite or countable number of values.
p(x) ≥ 0.
2. P
3. x p(x) = 1.
Continuous random variables. In time series analysis, we mostly work
with continuous random variables.
We say that a random variable X has a continuous distribution with the density
function f (x) = fX (x), called the (probability) density of X if
Z b
P {a ≤ X ≤ b} = f (x) dx
a
for all a < b. The density function must be non-negative, and
Z ∞
f (x) dx = 1
−∞
233
For a continuous random variable,

Z x
0
FX (x) = fX (t) dt, fX (x) = FX (x)
−∞
A1.4. Expectation. For a continuous random variable X with the density

f (x), its expectation is defined by the formula
Z ∞
EX = tf (t) dt
−∞
provided the integral converges absolutely. If X is discrete, then the expectation is

defined as X
EX = xp(x)
x
(again, we assume that the series converges absolutely). If the integral (or series)
does not converge, then the random variable does not have expectation (or, which
is the same, EX does not exist).
Properties of the expectation. Expectation of a random variable has the
following properties.
1. If c is a constant, then Ec = c
2. Linearity: E(aX + bY ) = aEX + bEY if both EX and EY exist. In
particular, E(aX + b) = aEX + b.
3. Monotonicity: If X ≥ Y , then EX ≥ EY .
4. If X ≥ 0 and EX = 0, then X = 0 with probability 1.
5. Useful formula. Suppose Y = g(X) where g is a real-valued function and X
has a continuous distribution with the density fX . Then
Z ∞
EY = Eg(X) = g(t)fX (t) dt
−∞
A1.5. Variance and standard deviation. Variance of a random variable

X is defined as
Var X = E(X − EX)2
provided all the expectations exist. If some of them don’t exist, then we say that
the variance does not exist (or is infinite). Variance could also be found by the
formula
Var X = E(X 2 ) − (EX)2
A square root of Var X is called the standard deviation of X and is denoted by SD X.
Intuitively, variance and standard deviation show how spread is the distribution.
Properties of the variance. The following properties could be easily derived
from the properties of the expectation.
1. Variance is non-negative: Var X ≥ 0.
2. If Var X = 0, then X is a constant with probability 1.
3. Scaling formula: Var(aX + b) = a2 Var X; SD(aX + b) = |a| SD X.
Markov and Chebyshev inequalities. Suppose X is a non-negative ran-
dom variable with finite expectation. The following inequality is known as Markov
inequality: for every positive a > 0,
EX
(1.1) P (X ≥ a) ≤
a
234
In order to prove it, consider a random variable

(
a if X ≥ a
U=
0 otherwise
By construction, U ≤ X and therefore EU = aP (X ≥ a) ≤ EX.

Assume now that random variable Y (not necessarily non-negative) has finite
expectation m = EY and variance σ 2 = Var(Y ). For every positive ε > 0, we have
σ2
(1.2) P (|Y − m| ≥ ε) ≤
ε2
(so called Chebyshev inequality). It follows at once from the Markov inequality
applied to X = (Y − m)2 and a = ε2 .
A1.6. Covariance. For two random variables X and Y , we define their
covariance by the formula
Cov(X, Y ) = E(X − EX)(Y − EY )
provided all the expectations exist. It could also be found by the formula
Cov(X, Y ) = E(XY ) − EXEY
In fact, computation of the covariance requires the joint distribution of X and Y .
Properties of the covariance. The following properties could be obtained
from the properties of expectation and variance.
1. It is symmetric: Cov(X, Y ) = Cov(Y, X), Cov(X, X) = Var X.
2. It is linear in both arguments: Cov(a1 X1 + a2 X2 , Y ) = a1 Cov(X1 , Y ) +
a2 Cov(X2 , Y ); same is true for the second argument.
3. It satisfies the Schwartz inequality (Problem 13):
(1.3) Cov(X, Y )2 ≤ Var X Var Y.
In particular, we have
Var(X1 + · · · + Xn ) = Cov(X1 + · · · + Xn , X1 + · · · + Xn )
n X
X n
= Cov(Xi , Xj )
i=1 j=1
Xn X
= Var(Xi ) + 2 Cov(Xi , Xj )
i=1 i<j
In contrast to variance, covariance can be of any sign. For instance, Cov(X, −X) =
− Var(X).
Correlation coefficient. It is defined by the formula
Cov(X, Y )
Corr(X, Y ) = ρ(X, Y ) = p
Var(x) Var(Y )
Schwartz inequality implies −1 ≤ Corr(X, Y ) ≤ 1. It could be shown that, if
| Corr(X, Y )| = 1, then Y = aX + b.
Variance—covariance matrix. Let X1 , . . . , Xn be random variables, and let
Cij = Cov(Xi , Xj ). A square matrix C = (Cij ) is called the variance-covariance
235
matrix, or just covariance matrix, for X1 , . . . , Xn . Covariance matrix is symmetric

and non-negatively defined : for every real numbers a1 , . . . , an ,
XX
ai aj Cij ≥ 0
i j
(to verify that, note that the expression in question is actually equal to Var(a1 X1 +
· · · + an Xn ) and therefore must be non-negative.)
For the rest of the exposition, we focus on continuous random variables.
A1.7. Joint distributions. We say that random variables X1 , . . . , Xn have a
continuous joint distribution with the density f (x1 , . . . , xn ) = fX1 ,...,Xn (x1 , . . . , xn )
if
Z b1 Z bn
P {a1 ≤ X1 ≤ b1 , . . . , an ≤ Xn ≤ bn } = ... f (x1 , . . . , xn ) dxn . . . dx1
a1 an
for every collection of real numbers a1 < b1 , . . . , an < bn . The function fX1 ,...,Xn is
called the joint density of random variables X1 , . . . , Xn
The joint density f is non-negative and satisfies the condition
Z ∞ Z ∞
... f (x1 , . . . , xn ) dxn . . . dx1 = 1
−∞ −∞
If the joint density of X1 , . . . , Xn is known, then the density of any sub-collection

could be found by integration. For instance, if f (x, y) is the joint density of X
and Y , then the densities of X and Y (marginal densities) could be found by the
formulae:
Z ∞ Z ∞
fX (x) = fXY (x, y) dy, fY (y) = fXY (x, y) dx
−∞ −∞
Useful formula. If Y = g(X1 , . . . , Xn ) and the joint density of X1 , . . . , Xn is

known to us, then
(1.4) Z ∞ Z ∞
EY = Eg(X1 , . . . , Xn ) = ... g(x1 , . . . , xn )f (x1 , . . . , xn ) dxn . . . dx1
−∞ −∞
A1.8. Independency of random variables. Random variables X1 , . . . , Xn

are called independent if, for every collection of real numbers ai < bi , i = 1, . . . , n,
events Ai = {ai ≤ Xi ≤ bi }, i = 1, . . . , n are independent. It could be shown that,
in case if Xi have a continuous joint distribution, then X1 , . . . , Xn are independent
if and only if the joint density is equal to the product of the marginal ones:
(1.5) fX1 ...Xn (x1 , . . . , xn ) = fX1 (x1 ) . . . fXn (xn )
An infinite sequence of random variables Xi , i = 1, 2, . . . is called independent if
any finite sub-collection is independent.
If X and Y are independent, then E(XY ) = EX EY (follows from the formula
(1.4) and from the factorization formula (1.5) for the joint density). In particular,
Cov(X, Y ) = 0, Corr(X, Y ) = 0 and therefore Var(X + Y ) = Var(X) + Var(Y ).
A1.9. Conditional density. Let X, Y have a joint density fXY (x, y). The
conditional density of Y given X is defined by the formula
fXY (x, y) fXY (x, y)
f (y|x) = fY |X (y|x) = = R∞ .
fX (x) f (x, y) dy
−∞ XY
236
Essentially, in order to construct f (y|x), we fix x. Then we consider g(y) =

fXY (x, y) as a function of y and normalize it so that it integrates to one. If X
and Y are independent, then the conditional density f (y|x) of Y given X = x does
not depend on x and coincides with the marginal density of Y .
In a similar way, we define conditional densities fY,Z|X , fZ|X,Y and so on.
A1.10. Conditional Expectation and Conditional Variance. The
conditional expectation of Y given X can be defined as an expectation with respect
to the conditional density:
Z ∞
(1.6) E(Y |X = x) = yfY |X (y|x) dy
−∞
It depends on the value x. Substituting the random variable X instead of x here,

we get E(Y |X) which is now a random variable.
In a similar way, we can define E(Y |X1 , . . . , Xn ) in terms of conditional den-
sity fY |X1 ,...,Xn . In fact, conditional expectation could be defined in a very general
setting, not only if the joint distribution is continuous (see advanced courses in
probability). Conditional expectation has all the standard properties of the expec-
tation (see A1.4) as well as the following three important properties:
1. If X and Y are independent, then E(Y |X) = EY .
2. E(E(Y |X)) = EY .
3. E[f (X)Y |X] = f (X)E(Y |X).
If X and Y do have a joint density, then these properties can be easily verified.
For instance, if X and Y are independent, then the conditional density f (y|x)
does not depend on x and coincides with the marginal density fY (y). Therefore
E(Y |X = x) = EY does not depend on x, which implies the first of the properties.
For the second one, denote by g(x) the conditional expectation E(Y |X = x) defined
by the formula (1.6). Then
Z ∞
E(E(Y |X)) = Eg(X) = g(x)fX (x) dx
−∞
Z ∞ Z ∞
fXY (x, y)
= y dy fX (x) dx
−∞ −∞ fX (x)
Z ∞Z ∞
= yfXY (x, y) dy dx = EY
−∞ −∞
Along the same lines, we can verify that E[f (X)Y |X = x] = f (x)E(Y |X = x),
which implies the last property. If, however, X and Y do not have a joint density,
those properties still hold though their verification is more involved.
In a similar way, we can define a conditional variance Var(Y |X) as a vari-
ance with respect to the conditional distribution. For the conditional variance, the
following identity holds:
(1.7) Var(Y ) = E(Var(Y |X)) + Var(E(Y |X))
Example. Importance of conditional expectations could be seen from the
following problem. Suppose we want to predict Y given the value of X. In other
words, we’d like to choose a function h(x) such that h(X) is “close” to Y . To be
specific, we want the expectation
E(Y − h(X))2
237
to be as small as possible. We claim that the best h(X) coincides with g(X) =
E(Y |X). Indeed,
E(Y − h(X))2 = E((Y − g(X)) + (g(X) − h(X)))2
= E(Y − g(X))2 + E(g(X) − h(X))2 + 2E[(g(X) − h(X))(Y − g(X))]
Now, we claim that the last term on the right is equal to zero. To verify that, we
use the properties of conditional expectation. We have
E[(g(X) − h(X))(Y − g(X))] = E[E[(g(X) − h(X))(Y − g(X))|X]]
= E[(g(X) − h(X))E[Y − g(X)|X]]
(we used the second property and after that, the third one). But,
E(Y − g(X)|X) = E(Y |X) − g(X) = g(X) − g(X) = 0.
So,
E(Y − h(X))2 = E(Y − g(X))2 + E(g(X) − h(X))2
where the first term does not depend on h and the second is non-negative. So, the
whole thing is minimal if h = g.
A1.11. Distribution of a sum of independent random variables. If
X, Y are independent random variables and if fX , fY are corresponding densities,
then the density of Z = X + Y could be found as a convolution of the densities fX
and fY :
Z ∞
(1.8) fX+Y (z) = fX (x)fY (z − x) dx
−∞
A1.12. Distribution of functions of random variables. Let X be a

continuous random variable with density fX (x) and let Y = g(X) where g(x) is
a strictly monotone increasing or decreasing function. Denote by h the inverse
function to g, so that X = h(Y ). Density fY of the random variable Y could be
expressed in terms of fX and h. For instance, assume that g is monotone increasing.
We have
FY (y) = P (g(X) ≤ y) = P (X ≤ h(y)) = FX (h(y)
If g is decreasing,
FY (y) = P (g(X) ≤ y) = P (X ≥ h(y)) = 1 − FX (h(y)
Differentiating with respect to y, we get
1
(1.9) fY (y) = fX (h(y))|h0 (y)| = fX (h(y)) .
|g 0 (h(y))|
The absolute value sign allows us to combine both cases in one formula.
The formula (1.9) can be extended to a multivariate case. Namely, suppose
that X1 , . . . , Xn are random variables with known joint density fX1 ...Xn , and
Y1 = g1 (X1 , . . . , Xn )
(1.10) ...
Yn = gn (X1 , . . . , Xn )
238
Assume that the functions g1 , . . . , gn are differentiable and define a 1-1 transforma-
tion, that is (1.10) implies
X1 = h1 (Y1 , . . . , Yn )
(1.11) ...
Xn = hn (Y1 , . . . , Yn )
(this assumption is harmless to us because we will be primarily interested in case
when g1 , . . . , gn are linear). Let

∂hi
H = (Hij ) =
∂yj
be the matrix of the partial derivatives of the inverse transformation. Then
(1.12)
fY1 ...Yn (y1 , . . . , yn ) = |J(y1 , . . . , yn )|fX1 ...Xn (h1 (y1 , . . . , yn ), . . . , hn (y1 , . . . , yn ))
where J = det H, the determinant of H = (Hij ), is called the Jacobian of the
transformation.
A1.13. Moment generating function. For a random variable X, its mo-
ment generating function, or m.g.f. φ(t) is defined by the formula
(1.13) φ(t) = φX (t) = EetX .
We have φ(0) = 1. The expectation EetX may fail to exist for some t 6= 0. However,
if it is defined in a neighborhood of 0, then EX k exists for all k = 1, 2, . . . and
(1.14) φ(k) (0) = EX k .
The expectation EX k is called the k-th moment of random variable X. According
to (1.14), we can find all the moments by differentiating the moment generating
function at the origin, which explains the name.
Moment generating functions have many useful properties. Two of them are
of special importance. First of all, if moment generating functions of X and Y
coincide, then the distributions of X and Y coincide (in fact, it is enough to have
φX (t) = φY (t) in a neighborhood of zero). Second, if X and Y are independent,
then the m.g.f. of the sum X + Y is equal to the product of m.g.f.’s:
φX+Y (t) = Eet(X+Y ) = EetX etY = EetX EetY = φX (t)φY (t).
This property sometimes allows us to identify the distribution of the sum of two
independent random variables.
Finally, here is a useful scaling identity:
φaX+b (t) = etb φX (at).
A1.14. Characteristic function and Laplace transform. Moment gen-
erating functions have but one drawback, they may be undefined for some t, even
for all t 6= 0 (and then, they are quite useless). However, moment generating func-
tion has a big brother — characteristic function. For a random variable X, its
characteristic function ψ(t) is defined by the formula
ψ(t) = ψX (t) = EeitX
where i stands for the imaginary unit. It is therefore a complex-valued function,
more specifically, it is equal to
ψX (t) = E cos(tX) + iE sin(tX).
239
Since sin and cos are bounded functions, the characteristic function is well defined
for all t. It could be shown that the characteristic function is continuous. Also,
ψ(0) = 1. Properties of characteristic functions are similar to those of moment
generating functions. In particular, if characteristic functions of X and Y coincide
in a neighborhood of zero, then the distributions of X and Y coincide. Second, if
X and Y are independent, then the characteristic function of the sum X + Y is
equal to the product of characteristic functions:
ψX+Y (t) = Eeit(X+Y ) = EeitX eitY = EeitX EeitY = ψX (t)ψY (t).
Also,
ψaX+b (t) = eitb ψX (at)
For non-negative random variables, they usually use Laplace transforms. The
Laplace transform ϕX (t) for a non-negative random variable X is defined by the
formula
ϕ(t) = ϕX (t) = Ee−tX , t ≥ 0.
So, the Laplace transform is, essentially, the moment generating function restricted
to negative arguments: ϕX (t) = φX (−t). Still, the Laplace transform is well de-
fined for all non-negative t, it determines the distribution uniquely and ϕX+Y (t) =
ϕX (t)ϕY (t) for independent X, Y . In contrast to the characteristic function, Laplace
transform is real-valued. On the other hand, the characteristic function is the
Fourier transform of the density of X and therefore the density could be found
through the inverse Fourier transform. This trick does not work with Laplace
transforms (in order to find the density of X via its Laplace transform, we have to
consider it for complex arguments t).
2. Important Probability distributions

A2.1. Normal, or Gaussian, distribution. It is normally denoted by
N (m, σ 2 ). It has two parameters m and σ 2 , its density is given by the formula
1 (x − m)2
(2.1) ϕ(x) = √ exp{− }
2πσ 2σ 2
The parameters actually serve as the expectation and variance of the distribution:
EX = m, Var X = σ 2 , SD(X) = σ (See Problem 28). Distribution N (0, 1) is called
standard normal (corresponding random variable is often denoted by Z). In case
of a standard normal, (2.1) simplifies to
1 x2
(2.2) ϕ(x) = √ exp{− }
2π 2
In particular,
Z ∞
1 t2
(2.3) √ e− 2 dt = 1
−∞ 2π
(see Problem 25).
If X is normal with parameters m, σ 2 , then Z = (X −m)/σ is standard normal.
Distribution function of a standard normal random variable is denoted by Φ(x) and
is given by the integral:
Z x
1 t2
Φ(x) = P {Z ≤ x} = √ e− 2 dt
−∞ 2π
240
The integral can be evaluated only numerically. Tables of the standard normal
distribution could be found practically in any text in probability and statistics.
If X is normal with parameters m and σ 2 , then
P {|X − m| < aσ} = 2Φ(a) − 1.
In particular,
P {|X − m| < 1.96σ} = 0.95,
P {|X − m| < 2.58σ} = 0.99,
P {|X − m| < 3σ} = 0.9973,
P {|X − m| < 3.29σ} = 0.999
A linear combination of independent normal random variables is again normal:
if X1 and X2 are independent and normally distributed with parameters m1 , σ12
and m2 , σ22 , then aX1 + bX2 also has a normal distribution with parameters am1 +
bm2 , a2 σ12 + b2 σ22 (see Problem 33b).
Moment generating function and characteristic function of a normal distribu-
tion are given by the formulas
t2 σ 2 −t2 σ 2
φX (t) = etm e 2 , ψX (t) = eitm e 2
(see Problem 33a).

A2.2. Gamma distribution. A random variable has Gamma distribution
with parameters α and θ if its density is given by the formula
(
1 α−1 −x/θ
αx e , x>0
f (x) = Γ(α)θ
0, x≤0
where
Z ∞
(2.4) Γ(α) = xα−1 e−x dx
0
is the famous Gamma function. Here parameters α and θ must be strictly positive.
Easy computation (Problem 31) shows that EX = αθ, Var X = αθ2 .
Moment generating function and Laplace transform of the Gamma distribution
are given by the formulas
φX (t) = (1 − θt)−α , t < 1/θ
and
ϕX (t) = (θt + 1)−α , t > 0
Examples. 1. If Z is standard normal, then X = Z 2 has Gamma distribution
with parameters α = 1/2, θ = 2 (see Problem 29).
2. If α = 1, θ = 1/λ, then it is an exponential distribution: f (x) = λe−λx .
If X, Y are independent, X has Gamma distribution with parameters α, θ and
Y has Gamma distribution with parameters β, θ (note that the second parameter
coincides for X and Y ), then Z = X + Y also has Gamma distribution with
parameters α + β, θ (see Problem 32).
A2.3. χ2 -distribution. This is a special case of Gamma distribution. We
say that X has χ2 distribution with n degrees of freedom (and we write X is χ2 (n))
if X is Gamma with parameters α = n/2 and θ = 2. In particular, if Z is standard
normal, then Z 2 is χ2 (1) (see Example 1 above) and if Z1 , . . . , Zn are independent
standard normals, then X = Z12 + · · · + Zn2 is χ2 (n).
241
According to the formula above, if X is χ2 (n), then EX = n, Var(X) = 2n.

Moment generating function and Laplace transform of the χ2 distribution are given
by the formulas
φX (t) = (1 − 2t)−n/2 , t < 1/2
and
ϕX (t) = (2t + 1)−n/2 , t > 0.
If X and Y are independent and X is χ2 (n), Y is χ2 (m), then X + Y is
2
χ (n + m). The last property can be somewhat reversed: if X, Y are independent,
Z = X + Y and if X is χ2 (n), Z is χ2 (n + m), then Y is χ2 (m) (Problem 37). √
In time series analysis, χ2 (1) and χ2 (2) play special role. Since Γ(1/2) = π
(Problem 35), the density for χ2 (1) equals
(
√ 1√ e−x/2 , x>0
f (x) = 2π x
0, x≤0
For χ2 (2), the formula is surprisingly simple:
(
1 −x/2
e , x>0
f (x) = 2
0, x≤0
So, χ2 (2) is just an exponential distribution with parameter 1/2.
A2.4. More distributions. The following two distributions show up in
statistical tests.
Student distribution or t distribution. t distribution with n degrees of
freedom is defined as a distribution of a random variable
Z
X=p
Y /n
where Z is a standard normal, Y is χ2 (n) and Z and Y are independent. The
density of t distribution is given by the formula
x2 −(n+1)/2
ft (x) = C(n)(1 + )
n
where
1 Γ((n + 1)/2)
C(n) = √ .
πn Γ(n/2)
As n → ∞, the distribution converges to the standard normal. We are typically
interested in percentiles for the t distribution. Tables of the t distribution could be
found in standard texts in statistics.
Just in case, for t distribution, its moment generating function is defined only
for t = 0 and therefore is absolutely useless.
Fisher or F distribution. Let X, Y be independent random variable such
that X is χ2 (n1 ) and Y bis χ2 (n2 ). Then
X/n1
F =
Y /n2
has F distribution with (n1 , n2 ) degrees of freedom. Its density is given by the
formula
xn1 /2−1
fF (x) = Cn1 ,n2 , x>0
(n1 x + n2 )(n1 +n2 )/2
242
where
n /2 n /2 Γ((n1 + n2 )/2)
Cn1 ,n2 = n1 1 n2 2
Γ(n1 /2)Γ(n2 /2)
Tables of the F distribution could be found in standard texts in statistics.
Uniform distribution. Random variable X has a uniform distribution on
the interval (a, b) if its density is given by the formula
(
1
if a < x < b;
f (x) = b−a
0 otherwise
Its expectation and variance are equal to
a+b (b − a)2
EX = , Var(X) =
2 12
In time series, uniform distribution may show up when we deal with periodical
functions. For instance, we may consider a series
Xt = sin(θt + U )
where t = 0, 1, 2, . . . is interpreted as time and U is random. If we want all possible
phases to be equally likely, a natural assumption would be that U is uniform on
the interval (0, 2π).
Cauchy distribution. Random variable X has Cauchy distribution if its
density is given by the formula
1 1
f (x) =
π 1 + x2
This is an example of a distribution that has no expectation and variance. It
shows up on occasion as a result of transformation of other random variables.
For instance, if U is uniform on the interval (−π/2, π/2), then X = tan(U ) has
Cauchy distribution (Problem 19). If Z1 , Z2 are independent standard normals,
then X = Z1 /Z2 is also Cauchy (Problem 24) and so on.
A2.5. Multivariate normal distribution. It is the most important one
for the time series analysis. We say that X1 , . . . , Xn have a joint, or multivariate,
normal, a.k.a. Gaussian, distribution if there exist independent standard normal
random variables Z1 , . . . , Zm such that
X1 = a11 Z1 + · · · + a1m Zm + m1
X2 = a21 Z1 + · · · + a2m Zm + m2
(2.5)
...
Xn = an1 Z1 + · · · + anm Zm + mn
(such a representation is not unique). Multivariate normal distribution can be
uniquely characterized by its vector of the expectations
m = (m1 , . . . , mn ) = (EX1 , . . . , EXn )
and the covariance matrix C = (Cij ) where
Cij = Cov(Xi , Xj ).
If the representation (2.5) is given and A = (aij ), then the matrix C equals
C = AAT .
243
If the matrix C is degenerate, then random variables X1 , . . . , Xn are linearly

dependent and joint density does not exist.
If C = (Cij ) is non-degenerate, then its inverse C −1 = B = (Bij ) exists, and
the joint density of X1 , . . . , Xn could be found from the formula
1
(2.6) f (x1 , . . . , xn ) = e−Q(x1 ,...,xn )/2
(2π)n/2 | det C|1/2
where
X
Q(x1 , . . . , xn ) = (xi − mi )(xj − mj )Bij
i,j
Important properties of the multivariate normal distribution.

1. If X, Y have a multivariate normal distribution and if Cov(X, Y ) = 0, then
X, Y are independent. Indeed, then the covariance matrix C is a diagonal matrix,
hence its inverse B is also a diagonal matrix and therefore the joint density (2.6)
of X and Y equals to the product of the marginal densities.
2. If X1 , . . . , Xn have a multivariate normal distribution and if each of Y1 , . . . , Ym
is a linear combination of X1 , . . . , Xn :
n
X
Yi = bij Xj , i = 1, . . . , m,
j=1
then Y1 , . . . , Ym also have a multivariate normal distribution. Indeed, by definition,

Xi could be represented as linear combinations of some i.i.d. standard normals
Z1 , . . . Zk . But then, Yi are also equal to linear combinations of the same Z1 , . . . Zk .
The following property is even more important:
3. If Y, X1 , . . . , Xn have a multivariate normal distribution, then the condi-
tional expectation E(Y |X1 , . . . , Xn ) is a linear function of X1 , . . . , Xn :
E(Y |X1 , . . . , Xn ) = d0 + d1 X1 + · · · + dn Xn
Moreover, the conditional variance Var(Y |X1 , . . . , Xn ) is a constant. Verification

of this property requires long computations.
Finally, a useful formula. Let Y1 , Y2 , Y3 , Y4 have a multivariate normal distri-
bution with EYi = 0, i = 1, 2, 3, 4, and let Cij = Cov(Yi , Yj ). Then
(2.7) E(Y1 Y2 Y3 Y4 ) = C12 C34 + C13 C24 + C14 C23 .
In fact, this formula extends to a product of any even number of normal random
variables with zero expectations (so called Wick formula).
Example: bivariate normal distribution. It is a 2-dimensional distribu-
tion. It can be characterized by five parameters: expectations mX and mY of X
2
and Y , variances σX and σY2 of X and Y and, finally, the correlation coefficient
ρ = ρXY of X and Y . Then Cov(X, Y ) = ρσX σY ,
2
σX ρσX σY
C=
ρσX σY σY2
and
σY2

1 −ρσX σY
B= 2 2 2
σX σY (1 − ρ2 ) −ρσX σY σX
244
and therefore the joint density of X and Y is given by the formula

(2.8)
(
1 1 (x − m )2
X
fXY (x, y) = p exp − 2 2
2πσX σY 1 − ρ2 2(1 − ρ ) σX
)
(y − mY )2 (x − mX ) (y − mY )
+ − 2ρ
σY2 σX σY
3. Laws of large numbers and Central Limit Theorem

This small section contains results about limit behavior of the distribution of
a sum of i.i.d. random variables.
Suppose X1 , X2 , . . . are i.i.d. random variables. Let m = EXi and σ 2 = Var Xi .
Denote
Sn = X1 + · · · + Xn .
Weak Law of Large Numbers claims that for every ε > 0,
Sn
P {| − m| > ε} → 0 as n → ∞
n
(in words, Sn /n converges to m in probability).1
Weak law of large numbers follows easily from the computation
Sn 1 1 σ2
Var( ) = 2 Var Sn = 2 nσ 2 = →0
n n n n
as n → ∞ and from the Chebyshev inequality (1.2).
Strong Law of Large Numbers claims that
Sn
→m
n
with probability one.
Under some mild assumptions, strong law of large numbers implies the weak
one, but not the other way around. The proof of the Strong Law of large numbers
requires delicate methods.
√
If we divide Sn by n instead of n, then
Sn
Var( √ ) = σ 2
n
does not depend in n. √
For this reason, we may expect a non-trivial limiting distri-
bution for (Sn − nm)/ n (which now has an expectation zero and fixed variance
σ 2 ). √
Central Limit Theorem claims that the distribution of (Sn − nm)/σ n
converges to a standard normal. Namely, it claims that
Sn − nm
P {a ≤ √ ≤ b} → Φ(b) − Φ(a)
σ n
for every pair a < b.
Central Limit theorem allows us to use normal distribution as an approximation
to various distributions that can be represented as sums of i.i.d. random variables.
1A sequence of random variables X converges to random variable X in probability if, for
n
every ε > 0, P (|Xn − X| > ε) → 0 as n → ∞.
245
4. Convergence in mean squares

Let Xn be a sequence of random variables with finite expectations and vari-
ances, and let X be another variable. We say that the sequence Xn converges to
X in mean squares, and we write Xn → X (m.s.) if
lim E(Xn − X)2 = 0.
n→∞
We say that the sequence Xn is a Cauchy sequence in mean squares if, for every
ε > 0, there exists N = N (ε) such that E(Xn − Xm )2 < ε whenever n, m ≥ N . It
could be shown that every sequence Xn that is Cauchy in mean squares, actually
converges in mean squares to some random variable X (it follows from the so called
completeness of L2 spaces, see advanced courses in Real Analysis).
A typical application of this result in time series analysis is as follows. Let Yn
be a sequence of random variables with zero expectation and finite variances, and
let
Sn = Y1 + · · · + Yn
P∞
be a partial sum of an infinite series k=1 Yk . The sequence Sn converges in mean
squares (and therefore the sum of the series could be defined) if it is Cauchy in
mean squares, that is if, for every ε > 0, there exists N such that
Var(Sn − Sm ) ≤ ε
whenever n, m ≥ N . In time series analysis, this property allows us to define a sum
of an infinite series made out of random variables if the partial sums converge in
mean squares.
Exercises
1. Show that P (∅) = 0. Hint: Use axiom of countable additivity and set Ak =
for all k ≥ 2.
2. Let A1 , . . . , An be mutually
Pn exclusive events. Using the result of Problem
1, show that P (∪nk=1 Ak ) = k=1 P (Ak ).
3. Show that P (Ac ) = 1 − P (A) where Ac = Ω \ A is the complement of A.
4. Show that, if A ⊂ B, then P (A) ≤ P (B).
5. Show that P (A ∪ B) = P (A) + P (B) − P (A ∩ B)
6*. Let An be a sequence of events such that An ⊂ An+1 for all n. Show that
P (∪∞ n=1 An ) = limk→∞ P (An ).
7*. Let An be a sequence of events such that An ⊃ An+1 for all n. Show that
P (∩∞ n=1 An ) = limk→∞ P (An ).
8*. If P (B) > 0, show that P (AB|B) ≥ P (AB|A ∪ B) and P (A|A ∪ B) ≥
P (A|B).
9. If events A1 , A2 , . . . , An are independent, then Ac1 , A2 , . . . , An are also inde-
pendent (so we can replace any of the events from the list by its complement and
still have the independency).
10. If A1 , . . . , An are independent events, then
n
Y
P (A1 ∪ · · · ∪ An ) = 1 − (1 − P (Ak )).
k=1
11. If EX = 2 and Var(X) = 5, find E(X − 1)2 .

246
12. Let X1 , . . . , X5 be random variables with variance 4, and let

(
Cov(Xi , Xj ) = −1 if |i − j| = 1,
Cov(Xi , Xj ) = 0 if |i − j| ≥ 2.
Find
Var(X1 + X2 + X3 + X4 + X5 );
Var(X1 − X2 + X3 − X4 + X5 );
Var(X1 + 2X2 + 3X3 + 2X4 + X5 );
Cov(X1 + X2 + X3 + X4 + X5 , X1 − X2 + X3 − X4 + X5 )
13. Prove the Schwartz inequality Cov(X, Y )2 ≤ Var(X) Var(Y ). [Hint: Ex-
press Var(X + tY ) in terms of Var(X), Var(Y ) and Cov(X, Y ) and consider it as a
function of t. Still, it’s a variance, so it must be non-negative for all t. How it is
possible?]
14. (a) Show that Var(X) ≤ E(X−a)2 for every a; (b) Show that, if 0 ≤ X ≤ 1,
then Var(X) ≤ 1/4.
R ∞ 15*. Let X be a non-negative continuous random variable. Show that EX =
0
P (X > t) dt.
R ∞ 16*. Let X be a non-negative continuous random variable. Show that EX n =
n−1
0
nt P (X > t) dt.
R∞
17*. Let X be a continuous random variable. Show that EX = 0 P (X >
R∞
t) dt − 0 P (X < −t) dt. [Hint to Problems 15-17: Express P (X > t) and
P (X < −t) as integrals and change the order of integration.]
18. Let X be a random variable with density f (x). (a) Find a density of
2
Y = aX + b. (b) Suppose X is N (mX , σX ). Verify that Y is also normal with
2 2
parameters N (amX + b, a σX )
19. Let U be uniformly distributed on the interval (−π/2, π/2). Show that
X = tan(U ) has Cauchy distribution.
20. Let the joint density of X and Y be given by the formula
(
x+y if 0 ≤ x ≤ 1, 0 ≤ y ≤ 1
f (x, y) =
0 otherwise.
(a) Find Cov(X, Y ); (b) Find the marginal density of X; (c) Find the conditional
density of Y given X = x and the conditional expectation E(Y |X = x).
21. Let the joint density of X and Y be given by the formula
(
1 if − 1 < x < 1, 0 ≤ y ≤ 1 − |x|
f (x, y) =
0 otherwise.
(a) Find Cov(X, Y ); (b) Find the marginal density of X and that of Y ; (c) Find the
conditional density of Y given X = x and the conditional expectation E(Y |X = x).
22. Let X1 , . . . , Xn be i.i.d. standard normals and let
Y1 = X1
Y2 = 2X1 + X2 ,
Y3 = 2X1 + 2X2 + X3 ,
...
Yn = 2(X1 + X2 + · · · + Xn−1 ) + Xn
247
Use the formula (1.12) to find the joint density of Y1 , . . . , Yn .

23. Let U, W be independent random variables, U be uniform on the interval
(0, 2π)√and W be exponential√ with parameter 1. With help of (1.12), show that
Z1 = 2W cos(U ) and Z2 = 2W sin(U ) are independent standard normal.
24. Let X1 , X2 be independent standard normal random variables. With help
of (1.12), find the joint density function of Y1 = X1 and Y2 = X X2 . Conclude from
1
here that XX2 has Cauchy distribution.

1
25. Verify (2.3). Hint: You have to show that

Z ∞ √
x2
e− 2 dx = 2π
−∞
To this end, consider a double integral

Z ∞Z ∞
x2 +y 2
e− 2 dx dy
−∞ −∞
and evaluate it by changing to polar coordinates.

26. Let Z be a standard normal random variable, and let n > 0 be an integer.
Verify that
EZ 2n−1 = 0
and
EZ 2n = (2n − 1)!!
where (2n − 1)!! = 1 · 3 · 5 · · · · · (2n − 1). Hint: use (2.3), induction and integration
by parts.
27. Let Z be standard normal random variable. Compute Cov(Z, Z 2 ).
28. Let X be a normal random variable with parameters m, σ 2 . Verify that
EX = m and Var X = σ 2 .
29. Let Z be a standard normal random variable. (a) Compute the probability
F (a) = P (Z 2 ≤ a) in terms of the distribution function of Z. (b) Differentiating in
a, show that Z 2 has Gamma distribution with parameters α = 12 and θ = 2.
30. Integrating by parts, verify that
Γ(α + 1) = αΓ(α)
for all α > 0. Conclude from here that Γ(n + 1) = n! for all integers n = 0, 1, 2, . . .
(first, find Γ(1), then use induction).
31. Let X have Gamma distribution with parameters α, θ. Using the result of
Problem 30, verify that EX = αθ and Var X = αθ2 . [Hint: think of a substitution
that converts the integral to the one similar to (2.4).]
32. Verify that, if X, Y are independent Gamma distributed random variables
with parameters α, θ and β, θ, respectively, then X + Y has Gamma distribution
with parameters α + β and θ (use formula (1.8)).
33. (a) Verify that the moment generating function of a normal random vari-
able with parameters m, σ 2 is given by the formula
σ 2 t2
φ(t) = emt+ 2
(b) Show that, if X and Y are independent normal random variables with param-
eters m1 , σ12 and m2 , σ22 , respectively, then X + Y is also normal with parameters
m1 + m2 and σ12 + σ22 (use part (a) and the properties of moment generating func-
tions).
248
34*. Let (X, Y ) have a bivariate normal distribution with the density given
by the formula (2.8).
(a) Assuming that the identity (2.3) (see Problem 25 above) is known, show
2
that the marginal distribution of X is N (mX , σX ).
(b) Verify that ρ is the correlation coefficient of X and Y . Conclude from
here (and from part (a)) that, for a bivariate normal distribution, X and Y are
independent if and only if they are uncorrelated.
(c) Using the result in part (a), show that the conditional distribution of Y
given X = x is normal with parameters mY + ρ σσX Y
(x − mX ) and σY2 (1 − ρ2 ). In
particular, the conditional expectation
σY
E(Y |X) = mY + ρ (X − mX )
σX
is linear and the conditional variance Var(Y |X) = σY2 (1 − ρ2 ) is a constant (does
not depend on X).
(d) Using part (c), verify that the relation E(E(Y |X)) = EY , as well as (1.7),
holds in this case. √
35. Show that Γ( 21 ) = π. Hint: Write down the integral (2.4) with α = 12
√
and try a substitution u = 2x. Compare the result with (2.3).
36. Show that if X and Y are identically distributed, not necessarily indepen-
dent, then
Cov(X + Y, X − Y ) = 0
37. Let X, Y be independent random variables, and let X be χ2 (n). Suppose
X + Y is χ2 (n + m). Show that Y is χ2 (m). [Hint: Express φX+Y in terms of φX
and φY and solve for φY . Compare the result with the moment generating function
of χ2 (m).]
38. Let X, Y be independent random variables, and let X be N (m1 , σ12 ).
Suppose X + Y is N (m1 + m2 , σ12 + σ22 ). Show that Y is N (m2 , σ22 ). [Hint: Express
φX+Y in terms of φX and φY and solve for φY . Compare the result with the
moment generating function of N (m2 , σ22 ).]
39*. Let X1 , . . . , Xn be i.i.d. random variables. Find
E(X1 |X1 + · · · + Xn = x)
40*. Let X1 , . . . , Xn be i.i.d. random variables that are strictly positive with
probability one. For all 0 ≤ k ≤ n, find

X1 + · · · + Xk
E
X1 + · · · + Xn
41*. Let X be a continuous random variable with finite expectation. A number
m is called a median if P (X < m) = 1/2. Show that the median m solves the
minimization problem
E|X − m| → min .
APPENDIX B
Elements of Statistics
1. Basics
In probability, starting point is a probabilistic model (e.g., independent random
variables with given distribution). Objective of the study is to establish certain
properties of the model. On contrast, in statistics, the starting point is a data set.
The probabilistic model for the data is not known or even does not exist. Typical
objective of the study is to find a reasonable model, estimate its parameters, test
statistical hypotheses, predict future behavior, etc.
Most of the common statistical methods are not applicable to time series with-
out major modifications because they are designed for samples, that is for the data
collected in the situation when the same experiment is repeated independently. We
discuss some basic methods, concepts and results.
Parametric and non-parametric approach. To some extent, statistical
methods could be divided into parametric and non-parametric ones. For paramet-
ric methods, we assume that the joint distribution of the data X1 , . . . , Xn is known
to us up to a few unknown parameters. We therefore have to estimate those pa-
rameters, to test statistical hypotheses about the values of the parameters etc. The
following rule of a thumb limits the complexity of the model: you should have at
least ten data points per parameter.
For non-parametric methods, we assume that the distribution belongs to a
certain class, e.g. has certain smoothness and so on. If our goal is to estimate the
distribution, then this approach requires bigger data sets; on the other hand, if
you have thousands of observations, then any parametric model would typically be
rejected (real world can’t be easily parameterized).
Looking ahead, some of the methods of time series analysis are non-parametric,
for instance spectrum estimation. Others are parametric, for instance ARMA mod-
els.
Here we focus on parametric methods.
B1.1. Example. Estimation of mean and variance. We begin with
the following problem. Suppose X1 , . . . , Xn are independent identically distributed
random variables with unknown expectation µ = EXi and unknown variance σ 2 =
Var Xi . We want to estimate µ and σ 2 , that is to construct functions µ̂(X1 , . . . , Xn )
and σ̂ 2 (X1 , . . . , Xn ) (estimates) that are ‘close’ to µ and σ 2 and therefore serve as
an ‘educated guess’ about the actual values of the parameters.
For instance, since EXi = µ, we may take the sample mean
X1 + · · · + Xn
X̄ =
n
as an estimate for µ. Clearly, E X̄ = µ. In addition, the law of large numbers
implies that X̄ converges to µ in probability. For those reasons, X̄ is a natural
249
250
estimate for the expectation. In order to estimate the variance, we may consider
n
1X
σ̂ 2 = (Xi − X̄)2
n 1
Direct computation shows that
n
1X 2
(1.1) σ̂ 2 = Xi − (X̄)2
n 1
Hence, by the law of large numbers applied to Xi and Xi2 , σ̂ 2 converges in proba-
bility to EX 2 − (EX)2 = σ 2 . Next, EXi2 = σ 2 + µ2 and E X̄ 2 = Var X̄ + (E X̄)2 =
σ 2 /n + µ2 . Therefore
n−1 2
E σ̂ 2 = σ 2 + µ2 − σ 2 /n − µ2 = σ .
n
For this reason, another statistic is often considered,
n
1 X n
(1.2) s2 = (Xi − X̄)2 = σ̂ 2
n−1 1 n−1
which still converges to σ 2 in probability and has the property

Es2 = σ 2 .
The statistics s2 is called a sample variance.
B1.2. Classification of estimates. Unbiasedness and Consistency
We say that an estimate θ̂ is an unbiased estimate for θ if
E θ̂ = θ
Otherwise, it is biased, and the difference b(θ) = E θ̂ − θ is called the bias.
We say that it is asymptotically unbiased if
E θ̂ → θ
as n → ∞. The meaning of the property is clear: in the average, our guess is not
biased or nearly not biased if the number of observations is large. In the above
example, sample mean X̄ and sample variance s2 are unbiased estimates for µ and
σ 2 . The estimate σ̂ 2 is asymptotically unbiased.
An estimate θ̂ is called consistent if
θ̂ → θ in probability,
that is, if, for every ε > 0, P (|θ̂ − θ| > ε) → 0 as n → ∞. So if the number of the
observations is large, then our guess is close to the actual value of the parameter
with high probability. Consistent estimates are, typically, asymptotically unbiased
(we need to pass to the limit under the sign of the expectation for that). In our
example, all three estimates are consistent because of the weak law of large numbers
(we have to apply it twice, to random variables Xi in order to show the consistency
of the sample mean, and to Xi2 as well, in order to show consistency of σ̂ 2 and of
sample variance.
Criterion of consistency. One can show that, if an estimate θ̂ is asymptoti-
cally unbiased and Var θ̂ → 0 as n → ∞, then θ̂ is consistent. Indeed,
E(θ̂ − θ)2 = Var(θ̂ − θ) + (E(θ̂ − θ))2 = Var θ̂ + (E(θ̂ − θ))2 → 0
251
as n → ∞ and therefore
1
P {|θ̂ − θ| > ε} ≤ E(θ̂ − θ)2 → 0
ε2
as well.
B1.3. Cramér-Rao Inequality. Fisher information. Suppose X1 , . . . , Xn
are i.i.d. with density f (x, θ), and let θ̂ be an unbiased estimate for parameter θ.
In addition, assume that the density f (x, θ) is twice differentiable with respect to
θ. Also, we assume that the set {x : f (x, θ) > 0} does not depend on θ. Then, for
every unbiased estimate θ̂,
1
(1.3) Var(θ̂) ≥
nI(θ)
where
2
∂ log f (Xi , θ)
(1.4) I(θ) = E
∂θ
is the Fisher information. The inequality (1.3) is known as the Cramér-Rao inequal-
ity. It therefore gives a lower bound for the variance of every unbiased estimate. A
sketch of the proof is given below in the ‘Details’ subsection.
If we have even more smoothness, then the formula (1.4) can be further reduced
to
∂ 2 log f (Xi , θ)
(1.5) I(θ) = −E( )
∂θ2
In some cases, the expression (1.5) is more convenient.
If there is not one but several unknown parameters θ1 , . . . , θk , then there exists
a multivariate version of Cramér-Rao inequality which gives a lower bound for the
variance-covariance matrix for θ̂1 , . . . , θ̂k provided those estimates are unbiased.
B1.4. Efficiency. Since the Cramér-Rao inequality gives a lower bound for
the variance of an estimate, it allows us to define an efficiency of an (unbiased)
estimate. Let θ̂ be an unbiased estimate for the parameter θ. We set
(nI(θ))−1
e(θ̂) =
E(θ̂ − θ)2
where I(θ) is the Fisher information. By (1.3), 0 ≤ e(θ̂) ≤ 1 for unbiased estimates.
The efficiency (as well as the estimate itself) depends on the number of observations.
If the variance of θ̂ is exactly equal to 1/(nI(θ)) and therefore e(θ̂) = 1, then
θ̂ is called efficient, or minimal variance unbiased estimate. Whenever minimal
variance unbiased estimators exist, they could be found by the method of maximum
likelihood.
If
e(θ̂) → 1
as n → ∞, then the estimate is called asymptotically efficient.
For instance, suppose that Xi are i.i.d. normal with parameters µ, σ 2 and only
µ is unknown. As we have already seen, X̄ is a natural estimate for µ, it is unbiased
and consistent. Moreover, it could be shown that,
E(X̄ − µ)2 = 1/(nI(µ))
252
and therefore the efficiency of X̄ equals to one. If Xi are normal but σ 2 is not known
then X̄ is only asymptotically efficient; if Xi are not normal, then the efficiency fails.
For instance, if Xi have so called double exponential distribution with the density
1
f (x, µ, θ) = 2θ exp{−|x − µ|/θ}, a.k.a. Laplace distribution, then the efficiency of
X̄ is about 2/3; the best estimate here is the sample median.
B1.5. Asymptotical Normality. The last ‘good’ property we’d like to have,
is called asymptotic normality. We say that an estimate θ̂ is asymptotically normal
if
θ̂ − θ ≈ N (0, σn2 )
as n → ∞. This property allows us to test statistical hypotheses about the values of
the parameter in case if σn2 is known to us or if it could be estimated. For instance,
2
let X̄ be the sample mean. The Central limit theorem implies that X̄ ≈ N (θ, σn ),
so the sample variance is asymptotically normal, and its variance is either known
or can be estimated. As for sample variance, χ2 distribution provides a better
approximation (though it is asymptotically normal as well).
There exists a number of methods of constructing ‘good’ estimates; we’ll discuss
only one.
B1.6. Maximum Likelihood estimates. Suppose Xi are i.i.d. random
variables with density f (x; θ) where θ stands for the unknown parameter (one or
several). Then the joint density of X1 , . . . , Xn is equal to the product
f (x1 , . . . , xn ; θ) = f (x1 ; θ) . . . f (xn ; θ)
Let us plug in the data X1 , . . . Xn instead of x1 , . . . , xn :
L(θ) = f (X1 , . . . , Xn ; θ)
The resulting function is called the likelihood function and it is the function of the
unknown parameter(s) only. In order to find the maximum likelihood estimate of
θ, we maximize the likelihood function L. The point θ̂ at which L achieves its
maximum, is called the maximum likelihood estimate for θ.
Since the joint density f (x1 , . . . , xn ; θ) of X1 , . . . , Xn is a product of the mar-
ginal densities, it is more convenient to deal with its logarithm. The function
log L(θ)
is called the log likelihood function. Clearly, L and log L achieve the maximum at
the same point, but log L is, typically, easier to work with.
The method of maximum likelihood is a formalization of the following idea. If
Xi are i.i.d. with the density f (x), then most of the observations Xi should belong
to the area of the large values of the density. So we choose such values of the
unknown parameters, that most of the values f (Xi ; θ) are large and none are very
small.
Under some regularity conditions imposed on the density f (x; θ) (similar to
those required for the Cramér-Rao inequality), the maximum likelihood estimates
are consistent, asymptotically unbiased, asymptotically normal and asymptotically
efficient. Computations that support this statement, could be found at the end of
the section.
Example. Let X1 , . . . , Xn be a sample of size n from a normal distribution
with unknown expectation µ and known variance σ 2 . Then the likelihood function
253
equals
(Xi − µ)2
P
1
L(µ) = exp − .
(2π)n/2 σ n 2σ 2
Since everything except µ is a constant, L reaches its maximum when (Xi − µ)2
P
is the smallest. Equally, we could consider a log likelihood function
(Xi − µ)2
P
n
log L(µ) = − log(2π) − n log σ −
2 2σ 2
and set the partial derivative with respect to µ to zero. Either way, we get an
equation X
(Xi − µ) = 0
which yields µ̂ = X̄. As we already know, this estimate is unbiased and consistent.
Remark. In fact, the method of maximum likelihood does not require the
observations to be i.i.d. Whenever the joint density of X1 , . . . , Xn is known up
to some unknown parameters, the same procedure can be applied. Properties of
the estimates have been studied in some, but not all, cases that are not i.i.d. No
exceptions are known to the following rule of a thumb: Whenever consistent esti-
mates exist at all, maximum likelihood estimates are consistent and asymptotically
efficient. Usually, they are also asymptotically normal.
B1.7. Confidence intervals. Instead of constructing an estimate µ̂ for the
unknown parameter µ, we may prefer to construct an interval [µ̂1 , µ̂2 ] such that it
contains the actual value of the parameter with large probability. Such an interval
is called a confidence interval for the parameter µ. The number
α = P {µ̂1 ≤ µ ≤ µ̂2 }
is called the level of significance of the interval. For instance, we say that it is a
95% confidence interval if
P {µ̂1 ≤ µ ≤ µ̂2 } = 0.95
Confidence intervals are more preferable than point estimates if the sample size is
small. They sometimes could be constructed from the point estimates if we know
the distribution of the estimate.
Example. Suppose X1 , . . . , Xn are i.i.d. N (µ, 1) and
µ̂ = X̄ = (X1 + · · · + Xn )/n
Then µ̂ − µ is also normal with the expectation 0 and the variance 1/n. Therefore
√
Z = n(µ̂ − µ)
is standard normal and therefore
P {−1.96 ≤ Z ≤ 1.96} = 0.95
(we check the table of the standard normal distribution for that). Solving for µ, we
see that
1.96 1.96
{−1.96 ≤ Z ≤ 1.96} = {µ̂ − √ ≤ µ ≤ µ̂ + √ }
n n
So,
1.96 1.96
[µ̂ − √ , µ̂ + √ ]
n n
is a 95% confidence interval for µ.
254
B1.8. Testing of statistical hypotheses. To outline the main concepts, let

us discuss a simplest example. Suppose X1 , . . . , Xn are i.i.d. and we would like to
test a hypothesis
H0 : X1 , . . . , Xn are standard normal (N (0, 1))
versus
H1 : X1 , . . . , Xn are N (m, 1) where m > 0 is an unknown parameter.
The hypothesis H0 determines the joint distribution of X1 , . . . , Xn completely.
It is an example of a simple hypothesis about the value of a parameter (in our
case, it is the expectation). The hypothesis H1 determines the distribution up to
unknown parameter m. That is an example of a composite hypothesis.
We are going to use the following decision rule: compute X̄ = (X1 +· · ·+Xn )/n
and choose some number c > 0. If X̄ < c, we choose H0 , otherwise we choose H1 .
Generalizing, we formulate two statistical hypotheses H0 and H1 , each of them
may be simple (that is, it determines the joint distribution of the observations
uniquely, like H0 in our example) or be composite (the distribution belongs to a
certain class, like H1 in our example). To construct the decision rule, we choose a
critical region, that is, a domain C ⊂ Rn . If the vector (X1 , . . . , Xn ) belongs to C,
then we choose H0 , otherwise we reject H0 and choose H1 . The choice of the critical
region is the hardest part of the procedure. Typically, we, first, choose a critical
statistics γ(X1 , . . . , Xn ) and then set C = {(x1 , . . . , xn ) : γ(x1 , . . . , xn ) ≤ γ0 }. So,
if γ(X1 , . . . , Xn ) ≤ γ0 , we choose H0 , otherwise we choose H1 . In our example, the
critical statistics was the sample mean X̄.
Since we have a finite data set, our decision may be wrong. If we reject H0
when it is actually valid, we commit type 1 error. The corresponding probability
α = P {reject H0 |H0 }
is called the level of significance of the test. It clearly depends on the parameters
of the test (in our example, the sample size n and the number c).
If we accept H0 when H1 is valid, we commit type 2 error. The probability
κ = 1 − β = P {reject H0 |H1 } = 1 − P {accept H0 |H1 }
is called the power of the test (and β is the probability of the type 2 error). If H1
is not a simple hypothesis, then the power depends on the actual distribution. In
our example, the power of the test depends on the expectation m (and, of course,
on n and c).
Computation of the level of significance and of the power of the test requires
the knowledge about the distribution of the critical statistics γ under both H0
and H1 . In our example, X̄ is normal with the same expectation as the original
observations, and its variance is equal to 1/n. So, the computation of α and κ is
not a problem (do this!). Other distributions that show up in testing of statistical
hypotheses, include χ2 distribution, t-distribution, F distribution and some others.
B1.9. Details. 1. Fisher information and Cramér-Rao inequality
(sketch)*. First, we establish (1.5). Let X1 , . . . , Xn be a sample from a distribu-
tion with the density f (x; θ). We have
Z ∞
1 ∂f (Xi ; θ) ∂f (x; θ)
E = dx
f (Xi ; θ) ∂θ −∞ ∂θ
(1.6) Z ∞
∂ ∂
= f (x; θ) dx = 1 = 0.
∂θ −∞ ∂θ
255
To justify that, we need some mild conditions that would allow us to swap the
integral and the partial derivative. In a similar way, we can show that
Z ∞
∂ 2 f (Xi ; θ) ∂2

1
(1.7) E = 2 f (x; θ) dx = 0.
f (Xi ; θ) ∂θ2 ∂θ −∞
The equation (1.6) is equivalent to the identity
∂ log f (Xi ; θ)
E =0
∂θ
In turn, one can show that (1.7) implies
∂ 2 log f (Xi ; θ)
E = −I(θ).
∂θ2
Indeed,
!2
∂
∂ 2 log f (Xi ; θ) 1 ∂ 2 f (Xi ; θ) ∂θ f (Xi ; θ)
2
= −
∂θ f (Xi ; θ) ∂θ2 f (Xi ; θ)
by the chain rule. The expectation of the first term equals zero by (1.7), and the
second term is equal
2
∂
− log f (Xi ; θ)
∂θ
so its expectation coincides with −I(θ) by (1.4).
We now move to the proof of Cramér-Rao inequality. It is based on the Schwartz
inequality (A1.3). Denote
n
X ∂ log f (Xi ; θ)
B= .
i=1
∂θ
By (1.6),
E(B) = 0.
Next, B is a sum of independent identically distributed random variables with zero
expectation and therefore
n 2
X ∂ log f (Xi ; θ) ∂ log f (X1 ; θ)
Var(B) = Var = nE = nI(θ)
1
∂θ ∂θ
Let θ̂ = θ̂(X1 , . . . , Xn ) be an unbiased estimate of parameter θ. We want to compute

Cov(B, θ̂). Since E(B) = 0, it coincides with E(B θ̂). Also,
∂
∂ log(f (X1 ; θ) . . . f (Xn ; θ)) ∂θ (f (X1 ; θ) . . . f (Xn ; θ))
B= =
∂θ f (X1 ; θ) . . . f (Xn ; θ)
and therefore
Z ∞ Z ∞
∂
E(B θ̂) = ... θ̂(x1 , . . . , xn ) (f (x1 ; θ) . . . f (xn ; θ)) dx1 . . . dxn
−∞ −∞ ∂θ
Z ∞ Z ∞
∂
= ... θ̂(x1 , . . . , xn )f (x1 ; θ) . . . f (xn ; θ) dx1 . . . dxn
∂θ −∞ −∞
∂ ∂
= E(θ̂) = θ = 1.
∂θ ∂θ
Therefore
1 ≤ Var(B) Var(θ̂) = nI(θ) Var(θ̂)
256
which is equivalent to (1.3).

2. Properties of the Maximum Likelihood Estimates (sketch)*. Sup-
pose X1 , . . . , Xn are i.i.d. random variables with density f (x, θ). Let θ0 be the
(unknown) actual value of the parameter. The maximum likelihood estimate θ̂
satisfies the equation

∂ log L(θ)
(1.8) = 0.
∂θ
θ=θ̂
Since
n
X
log L(θ) = log f (Xi ; θ)
i=1
the equation (1.8) can be written as follows
n
1 X ∂ log f (Xi ; θ)
(1.9) 0=
n i=1 ∂θ
θ=θ̂
(we plugged in the expression for log L and divided by n). Now, let us expand the
∂
function ∂θ log f (x; θ) around θ0 . We have
∂ 2 log f (x; θ)

∂ ∂ log f (x; θ)
log f (x; θ) = + (θ − θ0 ) + O((θ − θ0 )2 )
∂θ ∂θ
θ=θ0 ∂θ2
θ=θ0
Substituting this expression into (1.9) and denoting
n
1 X ∂ log f (Xi ; θ)
B0 = ,
n i=1
∂θ
θ=θ0
n
∂ 2 log f (Xi ; θ)

1 X
B1 =
n i=1
∂θ2
θ=θ0
we get the following formula

(1.10) 0 = B0 + (θ̂ − θ0 )B1 + O((θ̂ − θ0 )2 )

∂ log f (Xi ;θ)
Now, B0 is an average of i.i.d. random variables ∂θ . By (1.6),
θ=θ0
they have zero expectation. Therefore B0 converges to zero by the law of large
numbers. In a similar way, B1 converges to the expectation
∂ 2 log f (Xi ; θ)

E = −I(θ0 )
∂θ2
θ=θ0
(see (1.7) above). Since B0 goes to zero and B1 does not, there exists a solution θ̂
to (1.10) that is close to θ0 and we can conclude from here that
B0
θ̂ − θ0 ≈ −
B1
Next, since B0 → 0 and B1 converges to −I(θ0 ), one can conclude that E θ̂ → θ0 ,
so it is asymptotically unbiased. Next, variance of θ̂0 approximately equals

1 1 1 ∂ log f (Xi ; θ) 1
Var θ̂ ≈ 2 Var B0 ≈ 2 Var =
I (θ0 ) I (θ0 ) n ∂θ
θ=θ0 nI(θ0 )
So,
1
E(θ̂ − θ0 )2 ≈
nI(θ0 )
257
which makes θ̂ asymptotically efficient because of the Cramér-Rao inequality. Also,

B0 is approximately normal because of the Central Limit Theorem. Finally,
∂2 ∂2

nI(θ0 ) ≈ −nB1 = − 2 log L(θ) ≈ − 2 log L(θ) .
∂θ θ=θ0 ∂θ θ=θ̂
and therefore
1
Var(θ̂) ≈ − ∂2

log L(θ)θ=θ̂
∂θ 2
In case of several unknown parameters, their variance-covariance matrix could
be estimated as (negative of) the inverse to the matrix of second partial derivatives
of the log likelihood function.
2. Linear Regression and Least Squares

B2.1. Simple linear regression. Suppose the data set consists of n pairs
of observations (X1 , Y1 ), . . . , (Xn , Yn ) and we have some reason to believe that Y
(dependent variable) depends on X (independent variable) (so this is kind of an
input—output system). The Xs are not random (they are given to us or somehow
chosen by somebody). We want the best possible approximation of Y s by a function
of Xs.
Natural first option is to construct a linear approximation
Yi ≈ aXi + b
Depending on a, b, we have approximation errors εi = Yi −aXi −b. We would like to
make them “small” by appropriate choice of a and b. The simplest computational
procedure appears if we minimize the sum of the squares of the errors, that is
minimize
n
X
(2.1) Q(a, b) = (Yi − aXi − b)2
i=1
This procedure is called the method of least squares, and the corresponding esti-
mates are called least squares, or LS, estimates.
Computation of â and b̂ is really easy. Indeed, differentiating with respect to a
and b, we get two equations, so called normal equations:
n
∂Q X
= (−2Xi (Yi − aXi − b)) = 0
∂a i=1
(2.2) n
∂Q X
= (−2(Yi − aXi − b)) = 0
∂b i=1
To solve them, denote
Pn Pn
i=1 Xi Yi
X= , Y = i=1 ,
Pn n Pnn
i=1 Xi Yi X2
XY = , X 2 = i=1 i
n n
Then the equations (2.2) can be rewritten as
XY = aX 2 + bX,
(2.3)
Y = aX + b
258
and therefore
XY − X · Y
â = 2 , b̂ = Y − âX
X2 − X
B2.2. Probabilistic Motivation (Normal Linear Regression Model).
Suppose random variables Yi are independent and normally distributed with ex-
pectations aXi + b and with the same variance σ 2 where a, b and σ 2 are unknown
parameters and Xi are known (non-random) numbers. Let us construct the max-
imum likelihood estimates for the parameters a, b, σ 2 . The density of the random
variable Yi is given by the formula
1 (y − aXi − b)2
fYi (y) = √ exp{− }
2πσ 2σ 2
Since Yi are independent, the joint density of Y1 , . . . , Yn is equal to the product of
the marginal ones:
Pn 2
2 1 i=1 (yi − aXi − b)
FY1 ,...,Yn (y1 , . . . , yn ; a, b, σ ) = exp{− }
(2π)n/2 σ n 2σ 2
Therefore the log likelihood function is equal to
Pn
2 n (Yi − aXi − b)2
log L(a, b, σ ) = log F (Y1 , . . . , Yn ) = − log(2π) − n log σ − i=1
2 2σ 2
n Q(a, b)
= − log(2π) − n log σ −
2 2σ 2
where Q is given by (2.1). For every σ, the maximal value of log L corresponds to
the minimal value of Q and therefore the maximum likelihood estimates of a and
b coincide with the least squares estimates found above. However, we are able to
estimate σ 2 as well. Namely, let â, b̂ be the least squares estimates. Differentiating
with respect to σ, we get the equation
∂ log L n Q(â, b̂)
=− + =0
∂σ σ σ3
which implies
Q(â, b̂)
σ̂ 2 = .
n
Note that this estimate is biased (but asymptotically unbiased, consistent and
asymptotically efficient). In order to get an unbiased estimate, we should divide by
n − 2 instead:
Q(â, b̂)
s2e =
n−2
2
The estimate â also admits a probabilistic interpretation. Since (X 2 − X )/n
and (XY − X · Y )/n are the estimates for the variance of X and the covariance
CXY of X and Y , it could be written as
ĈXY
â = 2
σ̂X
B2.3. Multiple linear regression. If Y depends on several factors X1 , . . . Xk ,
we arrive at multiple linear regression. For instance, assume that random variables
Yi are independent and normally distributed with expectations a0 +a1 Xi1 +a2 Xi2 +
. . . ak Xik and the same variance σ 2 . Here Xi1 , . . . , Xik are certain external factors
for the observation i (their values may change when we move to the next data
259
point). Once again, maximum likelihood estimates for a0 , . . . , ak could be found by

least squares. We compute the sum of the squares of the residuals as a function of
the unknown coefficients and minimize it. Namely, we set
n
X
Q(a0 , a1 , . . . , ak ) = (Yi − a0 − a1 Xi1 − a2 Xi2 − · · · − ak Xik )2
i=1
In order to minimize Q, we set to zero partial derivatives of Q with respect to

a0 , a1 , . . . , ak . Once again, we get a system of linear equations (so called normal
equations). This time, however, it is much easier to write them down in matrix
notation. To this end, we have to consider three matrices, one of them contains all
Xs, second is a vector of Y s and the last one is the vector of coefficients. Namely,
we set
     
1 X11 X12 . . . X1k Y1 a0
 1 X21 X22 . . . X2k   Y2   a1 
X= . . . . . .
, Y =  , A =  
... ... ...  . . . . . .
1 Xn1 Xn2 . . . Xnk Yn ak
In terms of those matrices,
Q(a0 , . . . , ak ) = Q(A) = (Y − XA)T (Y − XA)
and the vector Â that minimizes Q(A), equals
Â = (XT X)−1 XT Y
Even more, an unbiased estimate for σ 2 could be found as s2e = Q(Â)/(n − k − 1)
(recall that k + 1 is exactly the number of unknown coefficients in the model).
Exercises
1. Let X1 , . . . , Xn be a sample of size n from a distribution with expectation µ
and variance σ 2 and let µ̂ = (2X1 + X2 + · · · + Xn−1 + 2Xn )/(n + 1) be an estimator
for µ. Is it unbiased? asymptotically unbiased? consistent?
2. Let X1 , . . . , Xn be a sample of size n from a normal distribution with
parameters µ, σ 2 . Assuming σ 2 is known, show that the sample mean X̄ is a
minimal variance unbiased estimate for µ. To this end, compute the corresponding
Fisher information and compare the variance of the sample mean with the Cramér-
Rao bound.
parameters µ, σ 2 . Assuming µ is known, construct a maximum likelihood estimate
for σ 2 . Is it unbiased? Asymptotically unbiased? Consistent?
parameters µ, σ 2 . Construct maximum likelihood estimates for µ and σ 2 . Are they
unbiased? Asymptotically unbiased? Consistent?
5. Let X1 , . . . , Xn be a sample of size n from a Gamma distribution with
parameters α and θ (see Section A2.2). Assuming α is known, construct a maximum
likelihood estimate for θ.
6. Let X1 , . . . , Xn be a sample of size n from a distribution with the density
( β
αβxe−αx if x > 0
f (x, α, β) =
0 otherwise
260
where α > 0 and β > 0 (so called Weibull distribution). Assuming β is known, find
a maximum likelihood estimate for α.
7. Let X1 , . . . , Xn be a sample of size n from a distribution with the density
(
α
α+1 if x > 1
f (x, α) = x
0 otherwise
where α > 0 (so called Pareto distribution). Find a maximum likelihood estimate
for α.
8. Let X1 , . . . , Xn be a sample from normal distribution with parameters
mx , σ 2 and let Y1 , . . . , Yn be another sample from a normal distribution with pa-
rameters mY , σ 2 (variance is the same for both samples, samples are independent
from each other). Find the maximum likelihood estimates for mX , mY and σ 2 .
9*. Let X1 , . . . , Xn be a sample of size n from a distribution with the density
1
f (x, µ, θ) = exp{−|x − µ|/θ}
2θ
where θ > 0 (Laplace distribution). Find maximum likelihood estimates for µ and
θ.
10. For a normal linear regression model, verify that â and b̂ are unbiased
estimates for a and b.
11. For a multiple linear regression model, verify that Q(Â) = Y0 Y − Â0 X0 Y.
APPENDIX C
Complex Variables Essentials. Difference

equations
1. Basics
A complex number is an expression of the form a + ib where a, b are real num-
bers and i is the imaginary unit (an element with the property i2 = −1). Basic
operations with complex numbers are defined as follows:
(a1 + ib1 ) + (a2 + ib2 ) = (a1 + a2 ) + i(b1 + b2 )
(a1 + ib1 ) − (a2 + ib2 ) = (a1 − a2 ) + i(b1 − b2 )
(a1 + ib1 )(a2 + ib2 ) = (a1 a2 − b1 b2 ) + i(a1 b2 + a2 b1 )
We interpret a complex number z = a + bi as a point on the plane with rectangular
coordinates (a, b) (and as a corresponding vector as well). If z = a + bi, then
a = Re z is called the real part of z and b = Im z is the imaginary part of z.
Complex numbers of the form a = a + i0 are just real numbers. For this reason,
the horizontal axis is called the real axis. Numbers ib = 0 + ib are called purely
imaginary (and the vertical
√ axis is called the imaginary axis).
The distance |z| = a2 + b2 from the origin to (a, b) is called the absolute value
of z. In particular, one can check that |z1 z2 | = |z1 ||z2 |.
For a complex number z = a + ib, a number z̄ = a − ib is called a complex
conjugate for z. Geometrically, z̄ is a reflection of z in the real axis. It could be
easily seen that
z + w = z̄ + w̄
zw = z̄ w̄
z z̄ = |z|2
The last relation allows us to compute a reciprocal
1
z −1 = 2 z̄
|z|
It also helps us to divide complex numbers:
z z w̄ 1
= = z w̄
w ww̄ |w|2
Complex exponents. For real numbers x, y, we set
ex+iy = ex (cos y + i sin y)
There is a number of reasons to do so. One of them is a power series representation
z2
ez = 1 + z + + ...
2!
261
262
Also,
ez1 +z2 = ez1 ez2
With help of complex exponents, an arbitrary complex number can be represented
as
(1.1) z = reiθ
where r = |z| is the absolute value of z and θ, called the argument of z, is defined
up to a multiple of 2π. The representation (1.1) is called the exponential form of a
complex number.
In terms of complex exponents, we have
1 1
cos x = (eix + e−ix ), sin x = (eix − e−ix )
2 2i
for every real number x.
Geometrically, the condition |z| = 1 defines a unit circle. Its elements can be
represented in the form
z = cos θ + i sin θ = eiθ
where θ is the angle between the real axis and the vector z. Elements eiθ play a
special role. In particular,
eiω
eiω eiθ = ei(ω+θ) , eiω = e−iω , = ei(ω−θ) , ei2πn = 1
eiθ
One of the fundamental results of the theory is as follows. Every polynomial of
the order n has n roots (counting the multiplicities). In particular, every polynomial
can be uniquely factorized into the product of linear functions:
p(z) = an z n + · · · + a0 = an (z − z1 ) . . . (z − zn )
if an 6= 0. If the coefficients of the polynomial are real numbers, then p(z) = p(z̄)
and therefore if p(z) = 0, then p(z̄) = 0. That is, if a root of p(z) is not a real
number, then its conjugate is also a root.
Exponential form and polar coordinates. If reiθ is the exponential form
of a complex number z = x + iy, then r and θ are polar coordinates of the point
with rectangular coordinates (x, y). Vice versa, if (r, θ) are the polar coordinates
of the point (x, y) and if r > 0, then r = |x + iy| is the absolute value of z = x + iy
and θ is the argument of z. The only difference between polar coordinates and
exponential representation of a complex number is that, in polar coordinates, r is
allowed to be negative.
Roots of unity. That is a name for the solutions of the equation
zn = 1
Suppose z = reiθ . The equation becomes z n = rn einθ = 1 and therefore r = 1, nθ =
2πk, so θ = 2πk
n . So, all the solutions to the equation have the form
wk = ei(2πk)/n , k = 0, 1, 2, . . . , n − 1.
In a similar way, we can solve an equation z n = z0 . Namely, if z0 = r0 eiθ0 , then all
the solutions to z n = z0 are given by the formula
z = (r0 )1/n eiθ0 /n wk , k = 0, 1, 2, . . . , n − 1
where wk are the roots of unity.
Convergence of complex numbers. Convergence of a series. We say
that zn → z as n → ∞ if |zn − z| → 0. Since Re z ≤ |z| ≤ Re z + Im z and the
263
same is true for Im z, convergence of complex numbers is equivalent to convergence

of their real and imaginaryP parts. Pn
We say that a series k zk converges if a sequence of its partial
P sums k=1 zk
converges. P Otherwise, we say that the series diverges. A series k zk converges ab-
solutely if k |zk | converges. Absolutely convergent series always converges. How-
ever, it is possible that a series converges but not absolutely.
Exercises
1. Simplify
1+i
, (1 + i)(1 − 2i)
1−i
2. Convert the following numbers to the exponential form:
√
−1 + i, 2i, 1 + i 3, −0.5
3. Convert the following numbers to the usual (that is, a + bi) form:
1 √
eiπ , √ ei3π/4 , e−iπ , 2e−iπ/4
2
4. Which of the following numbers are inside the unit circle? Which are
outside? Which are on the unit circle?
1
1 + i, −i, √ ei3π/4 , 0.6 + 0.8i, e0.3i
2
5. With help of trig identities, verify that, for real numbers x, y,
eix eiy = ei(x+y)
6. By multiplying and dividing by (1 − z), verify that
1 − z n+1
1 + z + z2 + · · · + zn =
1−z
if z 6= 1.
7. Let θ = 2πk/n where k, n > 0 are some integers. Show that
eθi + e2θi + · · · + enθi = 0
if k is not a multiple of n, and
eθi + e2θi + · · · + enθi = n
otherwise. By taking real and imaginary parts, conclude that
cos θ + cos 2θ + · · · + cos nθ = 0,
sin θ + sin 2θ + · · · + sin nθ = 0
if k is not a multiple of n, and
cos θ + cos 2θ + · · · + cos nθ = n,
sin θ + sin 2θ + · · · + sin nθ = 0
otherwise.
264
8. Let θ = 2πk/n and ω = 2πl/n where n is even and 0 ≤ k, l ≤ n/2 are

integers. Using the trig identities and the results of the previous problem, show
that
n
X
cos(θt) sin(ωt) = 0
t=1
Xn
cos(θt) cos(ωt) = 0 if k 6= l
t=1
Xn
sin(θt) sin(ωt) = 0 if k 6= l
t=1
Xn
cos2 (θt) = n/2 if k 6= 0, n/2
t=1
Xn
cos2 (θt) = n if k = 0 or n/2
t=1
Xn
sin2 (θt) = n/2 if k 6= 0, n/2
t=1
Xn
sin2 (θt) = 0 if k = 0 or n/2
t=1
9. Let
p(z) = an z n + an−1 z n−1 + . . . a1 z + a0
and let
q(z) = an + an−1 z + · · · + a1 z n−1 + a0 z n
Show that, if p(z) = 0, then q(z −1 ) = 0 and vice versa. If
p(z) = an (z − z1 ) . . . (z − zn )
(hence z1 , . . . , zn are the roots of p(z)), then
q(z) = an (1 − zz1 ) . . . (1 − zzn )
2. Power series and Analytic functions

Power series and their properties. Let c0 , c1 , . . . be a sequence of complex
numbers. A series
∞
X
(2.1) ck z k
k=0
is called a power series (to be precise, it is a power series about the origin, but
those are the only ones that we need). Important facts about power series:
1. There exists a number R, 0 ≤ R ≤ ∞, such that the power series (2.1)
converges absolutely for all z such that |z| < R, and diverges for all z such that
|z| > R. The number R is called the radius of convergence. Note that this property
does not say anything about those z with |z| = R.
265
2. It is possible to differentiate a power series termwise. Namely, the following

two series have the same radius of convergence:
∞
X
ck z k
k=0
X∞
kck z k−1
k=1
Moreover, denote by S(z) the sum of the first of the series. Within the area of
convergence, the function S(z) is differentiable and its derivative S 0 (z) is equal to
the sum of the second series.
One of the most important (for us) power series is a geometric series
1 + z + z2 + . . .
It has the radius of convergence R = 1 (so it converges absolutely for all z such
that |z| < 1. It actually diverges if |z| ≥ 1 because |z k | = |z|k ≥ 1). Moreover, its
sum is equal to 1/(1 − z). Indeed,
1 − z k+1
1 + z + z2 + · · · + zk =
1−z
and we can pass to the limit here as k → ∞.
Differentiating the geometric series term by term, we get a representation
∞
1 X
= kz k−1
(1 − z)2
k=1
and, after one more differentiating,
∞
2 X
= k(k − 1)z k−2
(1 − z)3
k=2
and so on. This allows us to evaluate the sums
∞
X z
kz k =
(1 − z)2
k=1
∞
X z + z2
k2 z k =
(1 − z)3
k=1
and like.
1
Yet another application of a geometric series is as follows. Let f (z) = z−z0
where z0 6= 0. We have
1 1 1
(2.2) f (z) = − = − (1 + z/z0 + (z/z0 )2 + ...)
z0 1 − (z/z0 ) z0
and the power series on the right converges if and only if |z| < |z0 |. Differentiating
this series term by term, we get power series representations for the functions
1
(z−z0 )k
, k = 2, 3, . . . , with the same radius of convergence |z0 |.
Rational functions and their power series representation. Rational
function is a quotient of two polynomials. Out of many properties of rational
functions, we need only one here. Namely, suppose f (z) = p(z)/q(z) where p and
q are polynomials, and suppose q(0) 6= 0, so f (0) is well defined. To simplify the
statement, assume that p and q don’t have common roots (otherwise we can factor
266
them out and cancel). Let now R > 0 be the smallest of the absolute values of the
roots of the denominator q(z). Then f (z) can be represented as a sum of a power
series, with the radius of convergence that is exactly equal to R. To show that, we
should factor q(z) into the product of linear factors q(z) = bm (z − z1 ) . . . (z − zn )
1
and then represent f (z) as a linear combination of simple fractions (z−z i)
k (partial
fractions; if there is no multiplicities, then k = 1). It remains to represent each of

the simple fractions as a sum of a power series and figure out what is the smallest
radius of convergence.
Example. Let us find a power series representation for a function
1
f (z) =
1 − z + z2
√
We have 1 − z + z 2 = (z − z1 )(z − z2 ) where z1,2 = 12 ± i 23 = e±iπ/3 . With help
of partial fractions, we get

1 1 1
f (z) = −
z1 − z2 z − z1 z − z2
By applying (2.2) twice, we get a representation

1 1 1 1 1 1 1 2
f (z) = − + 2 − 2 z + 3 − 3 z + ...
z1 − z2 z2 z1 z2 z1 z2 z1
and the radius of convergence of the series is equal to |z1,2 | = 1. However, z1 and z2
are complex numbers and we expect the coefficients of the series to be real numbers.
To simplify the formula, we note that z1 and z2 are complex conjugates. Moreover,
z1 z2 = |z1 |2 = 1, so z1 and z2 are reciprocal to each other. Therefore
1 1
k
− k = z1k − z2k .
z2 z1
Also, z1 − z2 = z1 − z̄1 = 2i Im z1 = 2i sin(π/3) and z1k − z2k = z1k − z̄1k = 2 Im z1k =
2i sin(πk/3). As a result, we get a formula
sin(2π/3) sin(3π/3) 2 sin((k + 1)π/3) k
f (z) = 1 + z+ z + ··· + z + ...
sin(π/3) sin(π/3) sin(π/3)
Analytic functions. A complex-valued function f (z) of a complex argument
z is analytic in a domain if it is differentiable there. If a function is analytic, then
it is actually infinitely many times differentiable, and it can be locally represented
as a sum of a power series. All polynomials, rational functions and many other
functions are analytic. Out of all numerous facts about analytic functions, we need
this one.
Suppose f (z) and g(z) are two functions that are analytic in the same domain,
and suppose f (z) = g(z) if z belongs to certain smooth curve that lies inside the
domain (a segment of a straight line, an arc of a unit circle or anything else). Then
f and g coincide everywhere within the domain. This surprising property is easy
to explain. In fact, the values along the curve allow us to find the values of all
the derivatives at every point on the curve, and therefore to find the power series
representation which turns out to be the same for both functions.
Exercises
For each of the following functions, find its representation as a sum of a power
series. Find the corresponding radius of convergence.
267
1+z
1. f (z) = 1−2z
1
2. f (z) = (1−2z)2
1+z
3. f (z) = 2−3z+z 2 .
1
4. f (z) = 1+z+z 2 .
1
5. f (z) = 1−z+z 2 −z 3 .
6. f (z) = √1 .
1−2 3z+4z 2
3. Difference Equations
Difference equations play an important role in time series analysis when we
study autocorrelation function of a stationary series.
We say that a sequence xn , n = 0, 1, 2, . . . satisfies a difference equation of the
order k, if, for all n ≥ k,
(3.1) xn + a1 xn−1 + · · · + ak xn−k = 0
where a1 , . . . , ak are some (known) coefficients. We will assume that all the coeffi-
cients are real numbers, and the senior coefficient ak 6= 0. The relation (3.1) also
goes under the name ‘homogeneous linear recurrence relation with constant coeffi-
cients’. The equation (3.1) defines the sequence recursively; we need to know the
first k values (so called ‘seed values’). Nonetheless, we’d like to have a closed-form
formula for a general solution.
We begin with an associate characteristic polynomial
(3.2) p(z) = z k + a1 z k−1 + · · · + ak
It has k zeroes, some of them could be complex numbers. However, if p(z) = 0,
then p(z̄) = 0 as well. Hence, we have a collection of real roots, possibly with mul-
tiplicities, and a collection of complex roots that come in pairs (z and its conjugate
z̄), also with multiplicities.
A general solution to (3.1) could be written in terms of the roots of (3.2). Let
us suppose first that all the roots z1 , . . . , zk are real and distinct. Then a general
solution to the equation (3.1) is given by the formula
k
X
(3.3) xn = Cj zjn
j=1
Constants C1 , . . . , Ck could be found from the seed values x0 , . . . , xk−1 — we write

down the equation (3.3) for every n = 0, . . . , k − 1 and get k linear equations that
allow us to find the constants Ci .
As we can see, to each root z there corresponds a sequence z n that is a solution
to (3.1); those solutions are linearly independent and form a basis in the space of
all solutions.
If some of the roots are complex (but still, no multiplicities), the formula (3.3)
is still correct, though we should allow the coefficients Ci to be complex numbers.
In order to get a formula that produces real-valued sequences, we should represent
complex roots in the exponential form zj = rj (cos(θj ) + i sin(θj )) = rj eiθj . A
general solution is given by the formula
X X
(3.4) xn = Cj zjn + (Cj rjn cos(nθj ) + Cj+1 rjn sin(nθj ))
real roots conjugate pairs zj ,zj+1
268
So, in this case, to each real root z, there corresponds a solution z n ; to each pair
of conjugate complex roots z = r(cos(θ) ± i sin(θ)), there correspond two solutions
that are equal to rn cos(nθ) and rn sin(nθ) or, which is the same, to the real and
the imaginary parts of the sequence z n .
Situation is more complicate if not all of the roots are distinct. Suppose the
equation (3.2) has d distinct real roots z1 , . . . , zd with multiplicities mj , so that
m1 + · · · + md = k. The general solution to the equation (3.1) is given by the
formula
d m j −1
X X
(3.5) xn = Cj,r nr zjn
j=1 r=0
Hence, to each root z with multiplicity m, there correspond m linearly independent

solutions
z n , nz n , . . . , nm−1 z n .
If some of the roots are complex, we should mix together the formulas (3.4) and
(3.5); so that for every real root z with multiplicity m, we should include m solutions
z n , nz n , . . . , nm−1 z n ; for every pair of conjugate complex roots z = r(cos(θ) ±
i sin(θ)), each of them with multiplicity m, we should include 2m solutions
rn cos(nθ), nrn cos(nθ), . . . , nm−1 rn cos(nθ)
and
rn sin(nθ), nrn sin(nθ), . . . , nm−1 rn sin(nθ)
Examples. 1. First, consider the equation
(3.6) xn − 3xn−1 + 2xn−2 = 0, n≥2
with initial conditions x0 = 0, x1 = 1. The characteristic polynomial z 2 − 3z + 2
has roots z1 = 1 and z2 = 2. So, we have distinct real roots and general solution
to the equation (3.6) has the form
xn = C1 + C2 2n
In order to find the constants C1 and C2 , we set n = 0 and n = 1. For n = 0, we
get the equation C1 + C2 = 0. For n = 1, we get the equation C1 + 2C2 = 1. Hence
C2 = 1, C1 = −1 and
xn = 2n − 1.
2. Consider now the equation
(3.7) xn − 2xn−1 + 2xn−2 = 0, n ≥ 2,
with initial conditions√x0 = x1 = 1. Its characteristic polynomial z 2 − 2z + 2 has
roots z1,2 = 1 ± i = 2e±iπ/4 . So, we have a pair of complex roots. By (3.4),
general solution to the equation (3.7) has the form
√
xn = ( 2)n (C1 cos(nπ/4) + C2 sin(nπ/4))
Setting n = 0, we get an equation C1 = 1. For n = 1, we get an equation
√ √ √
1 = 2( 2/2 + C2 2/2) = 2(1 + C2 )
which implies C2 = −1/2. Therefore
√ 1
xn = ( 2)n (cos(nπ/4) − sin(nπ/4))
2
269
3. Consider now the equation

(3.8) xn − 4xn−1 + 4xn−2 = 0, n ≥ 2,
with initial conditions x0 = 1, x1 = 0. This time, its characteristic polynomial
z 2 − 4z + 4 = (z − 2)2 has one root z1,2 = 2 with multiplicity 2. By (3.5), general
solution to the equation (3.8) has the form
xn = C1 2n + C2 n2n
Setting n = 0, we get an equation C1 = 1. For n = 1, we get an equation
0 = 2(1 + C2 )
which implies C2 = −1. Therefore
xn = (1 − n)2n
Exercises
1. (Fibonacci numbers) Let x1 = x2 = 1 and let xn = xn−1 + xn−2 , n ≥ 3.
Find a closed-form representation for xn .
2. Let x1 = x2 = 1 and let
xn = xn−1 + 0.25xn−2 , n ≥ 3.
3. Let again x1 = x2 = 1 and let
√
xn = 2 3xn−1 + 4xn−2 , n ≥ 3.
4. Let now x1 = 0, x2 = x3 = 1 and let
xn = xn−1 − xn−2 + xn−3 , n ≥ 4.
(a) Find a closed-form representation for xn . (b) Do the same with x1 = x2 = x3 =
1.
5. Let again x1 = x2 = x3 = 1 and let
xn = 6xn−1 − 12xn−2 + 8xn−3 , n ≥ 4.
6*. Let x1 = x2 = 1, x3 = x4 = 0 and let
√ √
xn = 4 3xn−1 − 20xn−2 + 16 3xn−3 − 16xn−4 , n ≥ 5.
Index
F distribution, 241 band pass tangent filter, 193

χ2 distribution, 240 band reject ideal filter, 180
σ-additivity, 231 band reject tangent filter, 195
t distribution, 241 Bartlett window, 160
53X smoothing, 31 Bartlett-Priestley window, 161
Bayes formula, 232
Akaike Information criterion, AIC, 123 Box-Cox transformation, 13
analytic function, 266 Butterworth sine filter, 184
ARIMA process, 89 Butterworth tangent filter, 184
ARIMA process, prediction, 96
ARIMA, seasonal, 208 Cauchy distribution, 242
ARMA process, 85 central limit theorem, 244
ARMA process, invertibility condition, 85 characteristic function, 238
ARMA process, prediction, 95 Chebyshev inequality, 234
ARMA process, spectral density, 141 co-spectrum, 222
ARMA process, stationarity condition, 85 coherency function, 222
ARMA process, Yule—Walker equations, complex exponent, 261
85 complex number, 261
asymptotic stationarity, 62, 69, 79 complex number, absolute value of, 261
autocorrelation function (ACF), 53 complex number, conjugate, 261
autocovariance function (ACV), 52 complex number, exponential form, 262
autoregression of the first order, 62 complex number, imaginary part of, 261
autoregression of the first order, ACF and complex number, real part of, 261
PACF, 63 conditional density, 235
autoregression of the first order, prediction, conditional expectation, 236
93 conditional variance, 236
autoregression of the first order, spectrum confidence interval, 253
density, 140 confidence interval, level of significance, 253
autoregression of the first order, continuous random variable, 232
stationarity condition, 62 continuous random variable, density of, 232
autoregression of the order k, 78 convergence in mean squares, 245
autoregression of the order k, prediction, 94 convergence in probability, 244, 250
autoregression of the order k, stationarity convolution, 159, 237
condition, 78 correlation coefficient, 234
autoregression of the order k, Yule—Walker countable additivity, 231
equations, 79 covariance, 234
autoregression of the second order, 67 covariance matrix, 235
autoregression of the second order, Cramér-Rao inequality, 251
stationarity condition, 67 critical region, 254
autoregression of the second order, critical statistics, 254
Yule—Walker equation, 70 cross-amplitude spectrum, 222
cross-correlation function (CCF), 219
back shift operator, 65 cross-correlation function, estimation, 220
band pass ideal filter, 180 cross-covariance function (CCV), 219
271
272
cross-periodogram, 223 Fisher distribution, 241

cross-spectrum, 221 Fisher information, 251
cross-spectrum, estimation, 223
cross-spectrum, gain function, 222 gain function of the cross-spectrum, 222
cross-spectrum, phase of, 222 Gamma distribution, 240
cut-off point, 180 Gamma function, 240
Gaussian random variable, 239
Daniell window, 160 general linear process, 60
dependent variable, 257 general linear process, ACV, 60
difference equation, characteristic geometric series, 265
polynomial, 267 Gompertz curve, 22
difference equations, 267
difference operator, 66 high pass ideal filter, 180
differencing, 172 high pass tangent filter, 191
Dirichlet kernel, 160 high pass tangent filter, first order, 192
discrete Fourier transform, 155 high pass tangent filter, second order, 192
discrete Fourier transform, inverse, 155 Holt-Winters model, 207
discrete random variable, 232
independent random variables, 235
discrete random variable, probability mass
independent variable(s), 257
function of, 232
integrated periodogram, 175
efficiency of the window, 171 integrated spectrum, 174
estimate, asymptotically efficient, 251
Jacobian, 238
estimate, asymptotically normal, 252
joint density, 235
estimate, asymptotically unbiased, 250
joint distribution, 235
estimate, bias of, 250
estimate, consistent, 250 Kalman filter, 224
estimate, efficiency of, 251 Kalman filter, gain matrix, 229
estimate, maximum likelihood, 252 Kalman filter, missing data model, 227
estimate, minimal variance unbiased, 251 Kendall rank correlation, 45
estimate, unbiased, 250
event, 231 lag window, 158
event, elementary, 231 lag window generator, 158
event, impossible, 231 lag window, truncation point, 158
event, probability of, 231 lag window, window width, 158
events, disjoint, 231 Laplace distribution, 252
events, independent, 232 Laplace transform, 239
events, mutually exclusive, 231 law of large numbers, strong, 244
expectation, 233 law of large numbers, weak, 244
expectation, conditional, 236 law of total probability, 232
exponential smoothing, 33 leakage, 172
exponential smoothing, first order, 35 least squares, 17, 257
exponential smoothing, second order, 35 likelihood function, 252
exponential smoothing, zero order, 34 linear regression, 257
log likelihood function, 252
Fast Fourier transform, 176 logistic curve, 22
Fejer kernel, 169 low pass ideal filter, 180
filter, 179 low pass tangent filter, 187
filter, band pass, 180 low pass tangent filter, first order, 188
filter, band reject, 180 low pass tangent filter, second order, 188
filter, gain function, 198
filter, high pass, 180 marginal density, 235
filter, ideal, 180 Markov inequality, 233
filter, low pass, 180 maximum likelihood, 252
filter, phase, 198 median, 248
filter, recursive, 179 moment generating function, 238
filter, tangent, 184 moving average process of the first order, 57
filter, transfer function of, 179 moving average process of the first order,
finite impulse response (FIR) filter, 200 ACF and PACF, 57
273
moving average process of the first order, rational function, 265

invertibility condition, 57 rectangular window, 160
moving average process of the first order, roots of unity, 262
prediction, 94
moving average process of the first order, sample ACF, 101
spectral density, 140 sample ACV, 101
moving average process of the order l, 57 sample mean, 249
moving average process of the order l, sample PACF, 101
ACF, 58 sample space, 231
moving average process of the order l, sample variance, 250
invertibility condition, 82 Schwartz inequality, 234
moving average process of the order l, seasonal ARIMA, 208
prediction, 95 seasonal Box - Jenkins, 208
moving averages, 27 seasonal difference, 66
moving averages, 1st order, 27 Spearman rank correlation, 45
moving averages, 2nd order, 28 spectral bandwidth, 166
moving averages, weighted, 27 spectral density, 138
multiple linear regression, 258 spectral density, confidence intervals, 162
multivariate Gaussian distribution, 242 spectral density, normalized, 138
multivariate normal distribution, 242 spectral window, 159
spectral window generator, 159
normal distribution, 239 spectral window, bandwidth, 167
standard deviation, 233
outcome, 231 state-space model, 224
state-space model, measurement equation,
partial autocorrelation function (PACF), 54 226
Parzen window, 161 state-space model, transitional equation,
periodogram, 149 225
periodogram, truncated, 160 stationary process, 52
Portmanteau lack-of-fit test, 123 stationary process, completely, 52
positive semi-definiteness, 54 stationary process, estimation of the mean,
Potter 310 filter, 201 99
power series, 264 stationary process, estimation of the
power series, radius of convergence, 264 variance, 100
pre-whitening, 173 stationary process, second order, 52
principal frequencies, 150 stationary process, strictly, 52
probability, 231 stationary process, weakly, 52
probability, axioms, 231 stationary series, 10
probability, conditional, 231 statistical hypothesis, 254
statistical hypothesis, composite, 254
quadrature spectrum, 222 statistical hypothesis, simple, 254
statistical test, power of, 254
random variable, 232 stochastic process, 51
random variable, χ2 , 240 stochastic process, Gaussian, 52
random variable, (cumulative) distribution Student distribution, 241
function of, 232
random variable, c.d.f, 232 tangent filter, band pass, 193
random variable, continuous, 232 tangent filter, band reject, 195
random variable, discrete, 232 tangent filter, high pass, 191
random variable, expectation of, 233 tangent filter, low pass, 187
random variable, exponential, 240 Theil-Wage model, 208
random variable, Gaussian, 239 transfer function, 179
random variable, Laplace, 252 trend model, 17
random variable, moments of, 238 trend model, linear, 17
random variable, normal, 239 triangular window, 160
random variable, standard deviation of, 233 Tukey window, 161
random variable, standard normal, 239 Tukey-Hanning window, 161
random variable, variance of, 233 type 1 error, 254
random variables, independent, 235 type 2 error, 254
274
uniform distribution, 242
variance, 233
variance, conditional, 236
white noise, 17, 51

white noise, Gaussian, 51
Wick formula, 243
Wold’s representation theorem, 138

4540 17 PDF

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

4540 17 PDF

Загружено:

Авторское право:

Доступные форматы

Math 4540/5540

Introduction to Time Series Analysis

(c) Sergei Kuznetsov, 2009-2017

Chapter 1. Trend-based Models 17

Chapter 2. Stationary Models 51

Chapter 3. Stationary Models: Estimation 99

Chapter 4. Spectrum and its Estimation 137

Chapter 5. Filters 179

Chapter 6. Seasonality 205

Appendix A. Elements of Probability 231

Appendix B. Elements of Statistics 249

A time series is a collection of observed values of certain characteristics that

Figure 1. Beginner’s trouble. Data set and predictions, which

Figure 2. Monthly sales of a dry cleaning company, 12 years

are multiplicative, proportional to the achieved level. Trend-based models work

Figure 3. Air passengers, 12 years

Figure 4. Sunspots (annual data)

Figure 6. Exchange Rate

Figure 7. EEG data (20 seconds was supposed to be a structural

Figure 9. Power Plant data (about 2 weeks). Note clearly visible

Figure 10. Power plant data revisited. Another problem here -

Figure 11. Radioactive decay data (logarithmic scale)

0 100 200 300 400

Figure 12. Profile data

Figure 14. Dead Sea level (annual data)

Figure 15. Salt Lake level (annual data)

Figure 16. Lake Ontario (annual data)

A non-stationary series may contain trend (broadly speaking, a non-random,

Figure 17. Lake Huron (annual data)

Figure 18. Daily production data (with outliers)

Figure 19. Autocorrelation of the daily production data. Seventh

Figure 20. Periodogram (sample spectrum) of the daily produc-

Box-Cox transformation was commonly used for that purpose. It depends on

Figure 21. Sales of the dry cleaning services in logarithmic scale

Figure 22. Price index (crops), annual data, 1500-1869.

Figure 23. Price index (crops) in log scale.

(1.2) Xt = a0 + b(t − t̄) + εt

Differentiating with respect to b and substituting X̄ for a0 , we get an equation

From those equations, it follows (Problem 1) that

(here σ̂ε2 is given by (1.10)). So, in order to test the hypothesis H0 : b = 0, we

Figure 1. Data with fitted trend, −3.14409 + 0.509597t.

Figure 2. Same data set with an error (X29 = 100). Estimated

Figure 3. Same data set with an error (X29 = 500). Estimated

Figure 5. Least squares trend versus least absolute errors trend.

Figure 6. Least Absolute Errors trend and data errors. The

Another option is to use some penalty function ϕ:

and therefore the estimate (1.10) is unbiased.

b̂ have a joint Gaussian distribution as linear combinations of the noise ε1 , . . . , εN .

Cov(et , b̂) = Cov(εt , b̂) − Cov(â0 , b̂) − (t − t̄) Cov(b̂, b̂))

The expression on the left is a sum of N squares of independent standard normal

or, which is the same,

or, which is the same, the sum

be the moving average. Then

Log Crops Data

Figure 7. Logarithm of the crops price index data.

Log Crops Data

Figure 8. Moving Average of the first order, l = 10.

Log Crops Data

Figure 9. Moving Average of the first order, l = 25.

Log Crops Data

Figure 10. Moving Average of the 2nd order, l = 25.

Log Crops Data

Log Crops Data