Momentum Strategies

Electronic copy available at: http://ssrn.
com/abstract=2358988
University of Paris 7 - Lyxor Asset Management
Master thesis
Momentum Strategies:
From novel Estimation Techniques to
Financial Applications
Author:
Tung-Lam Dao
Supervisor:
Prof. Thierry Roncalli
September 30, 2011
Electronic copy available at: http://ssrn.com/abstract=2358988
Contents
Acknowledgments ix
Condential notice xi
Introduction xiii
1 Trading Strategies with L
1
Filtering 1
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Motivations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 L
1
ltering schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3.1 Application to trend-stationary process . . . . . . . . . . . . 3
1.3.2 Extension to mean-reverting process . . . . . . . . . . . . . . 4
1.3.3 Mixing trend and mean-reverting properties . . . . . . . . . . 8
1.3.4 How to calibrate the regularization parameters? . . . . . . . . 8
1.4 Application to momentum strategies . . . . . . . . . . . . . . . . . . 13
1.4.1 Estimating the optimal lter for a given trading date . . . . . 13
1.4.2 Backtest of a momentum strategy . . . . . . . . . . . . . . . . 15
1.5 Extension to the multivariate case . . . . . . . . . . . . . . . . . . . 16
1.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2 Volatility Estimation for Trading Strategies 21
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.2 Range-based estimators of volatility . . . . . . . . . . . . . . . . . . 22
2.2.1 Range based daily data . . . . . . . . . . . . . . . . . . . . . 22
2.2.2 Basic estimator . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.2.3 High-low estimators . . . . . . . . . . . . . . . . . . . . . . . 26
2.2.4 How to eliminate both drift and opening eects? . . . . . . . 28
2.2.5 Numerical simulations . . . . . . . . . . . . . . . . . . . . . . 29
2.2.6 Backtest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.3 Estimation of realized volatility . . . . . . . . . . . . . . . . . . . . . 42
2.3.1 Moving-average estimator . . . . . . . . . . . . . . . . . . . . 42
2.3.2 IGARCH estimator . . . . . . . . . . . . . . . . . . . . . . . . 43
2.3.3 Extension to range-based estimators . . . . . . . . . . . . . . 45
2.3.4 Calibration procedure of the estimators of realized volatility . 45
2.4 High-frequency volatility estimators . . . . . . . . . . . . . . . . . . . 50
i
2.4.1 Microstructure eect . . . . . . . . . . . . . . . . . . . . . . . 52
2.4.2 Two time-scale volatility estimator . . . . . . . . . . . . . . . 52
2.4.3 Numerical implementation and backtesting . . . . . . . . . . 55
2.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3 Support Vector Machine in Finance 59
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.2 Support vector machine at a glance . . . . . . . . . . . . . . . . . . . 60
3.2.1 Basic ideas of SVM . . . . . . . . . . . . . . . . . . . . . . . . 60
3.2.2 ERM and VRM frameworks . . . . . . . . . . . . . . . . . . . 65
3.3 Numerical implementations . . . . . . . . . . . . . . . . . . . . . . . 68
3.3.1 Dual approach . . . . . . . . . . . . . . . . . . . . . . . . . . 68
3.3.2 Primal approach . . . . . . . . . . . . . . . . . . . . . . . . . 75
3.3.3 Model selection - Cross validation procedure . . . . . . . . . . 76
3.4 Extension to SVM multi-classication . . . . . . . . . . . . . . . . . 77
3.4.1 Basic idea of multi-classication . . . . . . . . . . . . . . . . . 77
3.4.2 Implementations of multiclass SVM . . . . . . . . . . . . . . . 78
3.5 SVM-regression in nance . . . . . . . . . . . . . . . . . . . . . . . . 83
3.5.1 Numerical tests on SVM-regressors . . . . . . . . . . . . . . . 83
3.5.2 SVM-Filtering for forecasting the trend of signal . . . . . . . 84
3.5.3 SVM for multivariate regression . . . . . . . . . . . . . . . . . 87
3.6 SVM-classication in nance . . . . . . . . . . . . . . . . . . . . . . 91
3.6.1 Test of SVM-classiers . . . . . . . . . . . . . . . . . . . . . . 91
3.6.2 SVM for classication . . . . . . . . . . . . . . . . . . . . . . 95
3.6.3 SVM for score construction and stock selection . . . . . . . . 98
3.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
4 Analysis of Trading Impact in the CTA strategy 109
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
4.2 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
Conclusions 113
A Appendix of chaper 1 115
A.1 Computational aspects of L
1
, L
2
lters . . . . . . . . . . . . . . . . . 115
A.1.1 The dual problem . . . . . . . . . . . . . . . . . . . . . . . . . 115
A.1.2 The interior-point algorithm . . . . . . . . . . . . . . . . . . . 117
A.1.3 The scaling of smoothing parameter of L
1
lter . . . . . . . . 118
A.1.4 Calibration of the L
2
lter . . . . . . . . . . . . . . . . . . . . 119
A.1.5 Implementation issues . . . . . . . . . . . . . . . . . . . . . . 121
B Appendix of chapter 2 123
B.1 Estimator of volatility . . . . . . . . . . . . . . . . . . . . . . . . . . 123
B.1.1 Estimation with realized return . . . . . . . . . . . . . . . . . 123
ii
C Appendix of chapter 3 125
C.1 Dual problem of SVM . . . . . . . . . . . . . . . . . . . . . . . . . . 125
C.1.1 Hard-margin SVM classier . . . . . . . . . . . . . . . . . . . 125
C.1.2 Soft-margin SVM classier . . . . . . . . . . . . . . . . . . . . 126
C.1.3 -SV regression . . . . . . . . . . . . . . . . . . . . . . . . . . 127
C.2 Newton optimization for the primal problem . . . . . . . . . . . . . . 128
C.2.1 Quadratic loss function . . . . . . . . . . . . . . . . . . . . . 128
C.2.2 Soft-margin SVM . . . . . . . . . . . . . . . . . . . . . . . . . 129
Published paper 131
iii
List of Figures
1.1 L
1
T ltering versus HP ltering for the model (1.2) . . . . . . . . 5
1.2 L
1
-T ltering versus HP ltering for the model (1.3) . . . . . . . . . 5
1.3 L
1
C ltering versus HP ltering for the model (1.5) . . . . . . . . 7
1.4 L
1
C ltering versus HP ltering for the model (1.6) . . . . . . . . 7
1.5 L
1
TC ltering versus HP ltering for the model (1.2) . . . . . . . 8
1.6 L
1
TC ltering versus HP ltering for the model (1.3) . . . . . . . 9
1.7 Inuence of the smoothing parameter . . . . . . . . . . . . . . . . 10
1.8 Scaling power law of the smoothing parameter
max
. . . . . . . . . 11
1.9 Cross-validation procedure for determining optimal value
. . . . . 11
1.10 Calibration procedure with the S&P 500 index . . . . . . . . . . . . 13
1.11 Cross validation procedure for two-trend model . . . . . . . . . . . . 13
1.12 Comparison between dierent L
1
lters on S&P 500 Index . . . . . . 14
2.1 Data set of 1 trading day . . . . . . . . . . . . . . . . . . . . . . . . 23
2.2 Volatility estimators without drift and opening eects (M = 50) . . . 30
2.3 Volatility estimators without drift and opening eect (M = 500) . . 31
2.4 Volatility estimators with = 30% and without opening eect (M = 500) 31
2.5 Volatility estimators with opening eect f = 0.3 and without drift
(M = 500) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.6 Volatility estimators with correction of the opening jump (f = 0.3) . 32
2.7 Volatility estimators on stochastic volatility simulation . . . . . . . . 33
2.8 Test of voltarget strategy with stochastic volatility simulation . . . . 34
2.9 Test of voltarget strategy with stochastic volatility simulation . . . . 35
2.10 Comparison between dierent probability density functions . . . . . 36
2.11 Comparison between the dierent cumulative distribution functions . 36
2.12 Volatility estimators on S& P 500 index . . . . . . . . . . . . . . . . 37
2.13 Volatility estimators on on BHI UN Equity . . . . . . . . . . . . . . 37
2.14 Estimation of the closing interval for S&P 500 index . . . . . . . . . 38
2.15 Estimation of the closing interval for BHI UN Equity . . . . . . . . . 38
2.16 Likehood function for various estimators on S&P 500 . . . . . . . . . 39
2.17 Likehood function for various estimators on BHI UN Equity . . . . . 40
2.18 Backtest of voltarget strategy on S&P 500 index . . . . . . . . . . . 41
2.19 Backtest of voltarget strategy on BHI UN Equity . . . . . . . . . . . 41
2.20 Comparison between IGARCH estimator and CC estimator . . . . . 46
v
2.21 Likehood function of high-low estimators versus ltered parameter 47
2.22 Likehood function of high-low estimators versus eective moving window 48
2.23 IGARCH estimator versus moving-average estimator for close-to-close
prices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
2.24 Comparison between dierent IGARCH estimators for high-low prices 49
2.25 Daily estimation of the likehood function for various close-to-close
estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
2.26 Daily estimation of the likehood function for various high-low estimators 50
2.27 Backtest for close-to-close estimator and realized estimators . . . . . 51
2.28 Backtest for IGARCH high-low estimators comparing to IGARCH
close-to-close estimator . . . . . . . . . . . . . . . . . . . . . . . . . . 51
2.29 Two-time scale estimator of intraday volatility . . . . . . . . . . . . . 56
3.1 Geometric interpretation of the margin in a linear SVM. . . . . . . . 61
3.2 Binary decision tree strategy for multiclassication problem . . . . . 80
3.3 L
1
-regressor versus L
2
-regressor with Gaussian kernel for model (3.16) 84
3.4 L
1
-regressor versus L
2
3.5 Comparison of dierent regression kernel for model (3.16) . . . . . . 85
3.6 Comparison of dierent regression kernel for model (3.17) . . . . . . 86
3.7 Cross-validation procedure for determining optimal value C
. . . 87
3.8 SVM-ltering with xed horizon scheme . . . . . . . . . . . . . . . . 88
3.9 SVM-ltering with dynamic horizon scheme . . . . . . . . . . . . . . 88
3.10 L
1
-regressor versus L
2
3.11 Comparison of dierent kernels for multivariate regression . . . . . . 90
3.12 Comparison between Dual algorithm and Primal algorithm . . . . . . 92
3.13 Illustration of non-linear classication with Gaussian kernel . . . . . 92
3.14 Illustration of multiclassication with SVM-BDT for in-sample data 93
3.15 Illustration of multiclassication with SVM-BDT for out-of-sample data 94
3.16 Illustration of multiclassication with SVM-BDT for = 0 . . . . . . 94
3.17 Illustration of multiclassication with SVM-BDT for = 0.2 . . . . . 95
3.18 Multiclassication with SVM-BDT on training set . . . . . . . . . . 96
3.19 Prediction eciency with SVM-BDT on the validation set . . . . . . 97
3.20 Comparison between simulated score and Probit score for d = 2 . . . 101
3.21 Comparison between simulated score CDF and Probit score CDF for
d = 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
3.22 Comparison between simulated score PDF and Probit score PDF for
d = 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
3.23 Selection curve for long strategy for simulated data and Probit model 103
3.24 Probit scores for Eurostoxx data with d = 20 factors . . . . . . . . . 104
3.25 SVM scores for Eurostoxx data with d = 20 factors . . . . . . . . . . 105
A.1 Spectral density of moving-average and L
2
lters . . . . . . . . . . . 120
A.2 Relationship between the value of and the length of the moving-
average lter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
vi
List of Tables
1.1 Results for the Backtest . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.1 Estimation error for various estimators . . . . . . . . . . . . . . . . . 34
2.2 Performance of
2
HL
versus
2
CC
for dierent averaging windows . . . 42
2.3 Performance of
2
H
L versus
2
CC
for dierent lters of f . . . . . . . 42
vii
Acknowledgments
During the six months unforgettable in the R&D team of Lyxor Management, I have
experienced and enjoyed every moments. Apart from all the professional experiences
that I have learnt from everyones int the department, I did really appreciate the
great ambiance in the team which motivated me everyday.
I would like rst to thank Thierry Roncalli for his supervision during my stay in
the team. I did not ever imagine that I could learn so many interesting things within
my internship without his direction and his condence. Thierry has introduced me
the nancial concepts of the asset management world in a very interactive way. I
would say that I have learnt nance in every single discussion with him. He taught
me how to combine learning and practice. For the professional experiences, Thierry
has help me to ll the lag in my nancial knowledges by allowing me to work on
various interesting topics. He made me condent to present my understanding on
this eld. For the daily life, Thierry has shared his own experiences and teach me as
well how to adapt to this new world.
I would like to thank Nicolas Gaussel for his warming reception in Quantitative
management department, for his condence and for his encouragements during my
stay in Lyxor. I have a chance to work with him on a very interesting topic concerning
the CTA strategy which plays an important role in asset management. I would like
to thank Benjamin Bruder, my nearest neighbor, for his guide and his supervision
along my internship. Informally, Benjamin is almost my co-advisor. I must say that
I owe him a lot for all of his patience in every daily discussion in order to teach me
and to work out many questions coming up to my projects. I am really graceful for
his humorist quality which warm up the ambiance.
For all members of the R&D team, I would like to express my gratitude to them
for their helps, their advices and everything that they shared with me during my stay.
I am really happy to be one of them. Thank Jean-Charles for your friendship, for
all daily discussions and for your support for all initiatives in my projects. A great
thank to Stephane who always cheer up all the breaks with his intelligent humor. I
would say that I have learnt from him the most interesting view of the Binomial
world . Thank Karl for your explanation to your macro-world. Thank Pierre for all
your help on data collection and your passion in all explanation such as the story of
Merrill lynchs investment clock. Thank Zelia for very stimulated collaboration on
my last project and the great time during our internship.
For all persons in the other side of the room, I would like to thank Philippe
Balthazard for his comments on my projects and his point of view on nancial
aspects. Thank Hoang-Phong Nguyen for his help on data base and his support
during my stay. There are many other persons that I have chance to be in interaction
with but I could not cite here.
Thank to my parents, my sister who always believe in me and support me during
my deviation to a new direction. In the end, I would like reserve the greatest thank
to my wife and my son for their love and daily encouragement. They were always
behind me during the most dicult moments of this year.
x
Condential notice
This thesis is sujected to condential researchs in the R&D team of Lyxor Asset
Management. It is divided into two main parts. The rst part including three rst
chapers 1,2 and 3 consists of applications of novel estimation techniques for the
trend and the volatility of nancial time series. We will present the main results
in detail together with a publication in the Lyxor White Paper series. The second
part concerning the analysis in the risk-return framework (see The Lyxor White
Paper Series, Issue #7, June 2011) of the CTA performance will be skipped due the
condentiality. Only a brief introduction and the nal conclusion of this part (chaper
4) will be presented in order to sketch out the main features.
This document contains information condential and proprietary to Lyxor Asset
Management. The information may not be used, disclosed or reproduced without the
prior written authorization of Lyxor Asset Management and those so authorized may
only use the information for the purpose of evaluation consistent with authorization.
Introduction
Within the internship in the Research and Development team of Lyxor Asset Man-
agement, we studied novel technologies which are applicable on asset management.
We focused on the analysis of some special classes of momentum strategies such as
the trend-following strategies or the voltarget strategies. These strategies play a
crucial role in the quantitative management as they pretend to optimize the benet
basing on exploitable signals of the market ineciency and to limit the market risk
via an ecient control of the volatility.
The objectives of this report are two-fold. We rst studied some novel tech-
niques in statistic and signal treatment elds such as trend ltering, daily and high
frequency volatility estimator or support vector machine. We employed these tech-
niques to extract interesting nancial signals. These signals are used to implement
the momentum strategies which will be described in detail in every chapters of this re-
port. The second objective concerns the study of the performance of these strategies
based on the general risk-return analysis framework (see B. Bruder and N. Gaussel
7
th
White Paper, Lyxor). This report is organized as following:
In the rst chapter, we discuss various implementation of L
1
ltering in order
to detect some properties of noisy signals. This lter consists of using a L
1
penalty
condition in order to obtain the ltered signal composed by a set of straight trends
or steps. This penalty condition, which determines the number of breaks, is imple-
mented in a constrained least square problem and is represented by a regularization
parameter which is estimated by a cross-validation procedure. Financial time series
are usually characterized by a long-term trend (called the global trend) and some
short-term trends (which are named local trends). A combination of these two time
scales can form a simple model describing the process of a global trend process with
some mean-reverting properties. Explicit applications to momentum strategies are
also discussed in detail with appropriate uses of the trend congurations.
We next review in the second chapter various techniques for estimating the volatil-
ity. We start by discussing the estimators based on the range of daily monitoring data
then we consider the stochastic volatility model in order to determine the instanta-
neous volatility. At high trading frequency, the stock prices are uctuated by an
additional noise, so-called the micro-structure noise. This eect comes from the bid-
ask spread and the short time scale. Within a short time interval, the trading price
does not reect exactly the equilibrium price determined by the supply-demand
but bounces between the bid and ask prices. In the second part, we discuss the
eect of the micro-structure noise on the volatility estimation. It is a very important
topic concerning a large eld of high-frequency trading. Examples of backtesting
on index and stocks will illustrate the eciency of considered techniques.
The third chapter is dedicated to the study of general framework of machine-
learning technique. We review the well-known machine learning techniques so-called
support vector machine (SVM). This technique can be employed in dierent contexts
such as classication, regression or density estimation according to Vapnik [1998].
Within the scope of this report, we would like rst to give an overview of this method
and its numerical variation implementation, then bridge it to nancial applications
such as trend forecasting, the stock selection, sector recognition or score construction.
We nish in Chapter 4 by the performance analysis of CTA strategy. We review
rst the trend-following strategies within Kalman lter and study the impact of the
trend estimator error. We start the discussion with the case of momentum strategy
on the single asset case then generalize the analysis to the multi-asset case. In order
to construct the allocation strategy, we employ the observed trend which is ltered
by exponential moving average. It can be demonstrated that the cumulated return of
the strategy can be splited into two important parts. The rst one is called Option
Prole which involves only the current measured trend. This idea is very similar
in concept to the straddle prole suggested by Fung and Hsied (2001). The second
part is called Trading Impact which involves an integral of the measured trend over
the trading period. We focus on the second quantity by estimating its probability
distribution function and associated gain and loss expectations. We illustrate how
the number of assets and their correlations inuence the performance of a strategy
via a toy model. This study can reveal important results which can be directly
tested on a CTA funds.
xiv
Chapter 1
Trading Strategies with L
1
Filtering
In this chapter, we discuss various implementation of L
1
ltering in order to detect
some properties of noisy signals. This lter consists of using a L
1
penalty condition
in order to obtain the ltered signal composed by a set of straight trends or steps.
This penalty condition, which determines the number of breaks, is implemented in a
constrained least square problem and is represented by a regularization parameter
which is estimated by a cross-validation procedure. Financial time series are usually
characterized by a long-term trend (called the global trend) and some short-term
trends (which are named local trends). A combination of these two time scales can
form a simple model describing the process of a global trend process with some
mean-reverting properties. Explicit applications to momentum strategies are also
discussed in detail with appropriate uses of the trend congurations.
Keywords: Momentum strategy, L
1
ltering, L
2
ltering, trend-following, mean-
reverting.
1.1 Introduction
Trend detection is a major task of time series analysis from both mathematical and
nancial point of view. The trend of a time series is considered as the component
containing the global change which is in contrast to the local change due to the
noise. The procedure of trend ltering concerns not only the problem of denoising
but it must take into account also the dynamic of the underlying process. That
explains why mathematical approaches to trend extraction have a long history and
this subject still gives a great interest in the scientic community
1
. In an investment
perspective, trend ltering is the core of most momentum strategies developed in the
asset management industry and the hedge funds community in order to improve
performance and to limit risk of portfolios.
1
For a general review, see Alexandrov et al. (2008).
1
1
Filtering
The paper is organized as follows. In section 2, we discuss the trend-cycle decom-
position of time series and review general properties of L
1
and L
2
ltering. In section
3, we describe the L
1
lter with its various extensions and the calibration procedure.
In section 4, we apply L
1
lters to some momentum strategies and present the re-
sults of some backtests with the S&P 500 index. In section 5, we discuss the possible
extension to the multivariate case and we conclude in the last section.
1.2 Motivations
In economics, the trend-cycle decomposition plays an important role to describe
a non-stationary time series into permanent and transitory stochastic components.
Generally, the permanent component is assimilated to a trend whereas the transitory
component may be a noise or a stochastic cycle. Moreover, the literature on business
cycle has produced a large number of empirical research on this topic (see for example
Cleveland and Tiao (1976), Beveridge and Nelson (1991), Harvey (1991) or Hodrick
and Prescott (1997)). These last authors have then introduced a new method to
estimate the trend of long-run GDP. The method widely used by economists is based
on L
2
ltering. Recently, Kim et al. (2009) have developed a similar lter by
replacing the L
2
penalty function by a L
1
penalty function.
Let us consider a time series y
t
which can be decomposed by a slowly varying
trend x
t
and a rapidly varying noise
t
process:
y
t
= x
t
+
t
Let us rst remind the well-known L
2
lter (so-called Hodrick-Prescott lter). This
scheme consists to determine the trend x
t
by minimizing the following objective
function:
1
2
n
t=1
(y
t
x
t
)
2
+
n1
t=2
(x
t1
2x
t
+x
t+1
)
2
with > 0 the regularization parameter which control the competition between the
smoothness of x
t
and the residual y
t
x
t
(or the noise
t
). We remark that the second
term is the discrete derivative of the trend x
t
which characterizes the smoothness of
the curve. Minimizing this objective function gives a solution which is the trade-o
between the data and the smoothness of its curvature. In nance, this scheme does
not give a clear signature of the market tendency. By contrast, if we replace the
L
2
norm by the L
1
norm in the objective function, we can obtain more interesting
properties. Therefore, Kim et al. (2009) propose to consider the following objective
function:
1
2
n
t=1
(y
t
x
t
)
2
+
n1
t=2
[x
t1
2x
t
+x
t+1
[
This problem is closely related to the Lasso regression of Tibshirani (1996) or the L
1
regularized least square problem of Daubechies et al. (2004). Here, the fact of taking
the L
1
norm will impose the condition that the second derivation of the ltered signal
2
1
Filtering
must be zero. Hence, the ltered signal is composed by a set of straight trends and
breaks
2
. The competition between these two terms in the objective function turns
to the competition between the number of straight trends (or number of breaks)
and the closeness to the raw data. Therefore, the smoothing parameter plays an
important role for detecting the number of breaks. In the later, we present briey
how the L
1
lter works for the trend detection and its extension to mean-reverting
processes. The calibration procedure for parameter will be also discussed in detail.
1.3 L
1
ltering schemes
1.3.1 Application to trend-stationary process
The Hodrick-Prescott scheme discussed in last section can be rewritten in the vec-
torial space R
n
and its L
2
norm ||
2
as:
1
2
|y x|
2
2
+|Dx|
2
2
where y = (y
1
, . . . , y
n
), x = (x
1
, . . . , x
n
) R
n
and the D operator is the (n 2) n
matrix:
D =
_
_
1 2 1
1 2 1
.
.
.
1 2 1
1 2 1
_
_
(1.1)
The exact solution of this estimation is given by
x
=
_
I + 2D
D
_
1
y
The explicit expression of x
allows a very simple numerical implementation with

sparse matrix. As L
2
lter is a linear lter, the regularization parameter is cali-
brated by comparing to the usual moving-average lter. The detail of the calibration
procedure is given in Appendix A.1.4.
The idea of L
2
lter can be generalized to a lager class so-called L
p
lter by using
L
p
penalty condition instead of L
2
penalty. This generalization is already discussed
in the work of Daubechies et al. (2004) for the linear inverse problem or in the
Lasso regression problem by Tibshirani et al. (1996). If we consider a L
1
lter, the
objective function becomes:
1
2
n
t=1
(y
t
x
t
)
2
+
n1
t=2
[x
t1
2x
t
+x
t+1
[
2
A break is the position where the trend of signal changes.
3
1
Filtering
which is equivalent to the following vectorial form:
1
2
|y x|
2
2
+|Dx|
1
It has been demonstrated in Kim et al. (2009) that the dual problem of this L
1
lter scheme is a quadratic program with some boundary constraints. The detail
of this derivation is shown in Appendix A.1.1. In order to optimize the numerical
computation speed, we follow Kim et al. (2009) by using a primal-dual interior
point method (see Appendix A.1.2). In the following, we check the ecient of this
technique on various trend-stationary processes.
The rst model consists of data simulated by a set of straight trend lines with a
white noise perturbation:
_
_
y
t
= x
t
+
t
t
A
_
0,
2
_
x
t
= x
t1
+v
t
Pr v
t
= v
t1
= p
Pr
_
v
t
= b
_
|
[0,1]
1
2
__
= 1 p
(1.2)
We present in Figure 2.19 the comparison between L
1
T and HP ltering schemes
3
.
The top-left graph is the real trend x
t
whereas the top-right graph presents the noisy
signal y
t
. The bottom graphs show the results of the L
1
T and HP lters. Here,
we have chosen = 5 258 for the L
1
T ltering and = 1 217 464 for HP ltering.
This choice of for L
1
T ltering is based on the number of breaks in the trend,
which is xed to 10 in this example
4
. The second model model is a random walk
generated by the following process:
_
_
y
t
= y
t1
+v
t
+
t
t
A
_
0,
2
_
Pr v
t
= v
t1
= p
Pr
_
v
t
= b
_
|
[0,1]
1
2
__
= 1 p
(1.3)
We present in Figure 1.2 the comparison between L
1
T ltering and HP ltering
on this second model
5
.
1.3.2 Extension to mean-reverting process
As shown in the last paragraph, the use of L
1
penalty on the second derivative gives
the correct description of the signal tendency. Hence, similar idea can be applied
for other order of the derivatives. We present here the extension of this L
1
ltering
technique to the case of mean-reverting processes. If we impose now the L
1
penalty
3
We consider n = 2000 observations. The parameters of the simulation are p = 0.99, b = 0.5
and = 15.
4
We discuss how to obtain in the next section.
5
The parameters of the simulation are p = 0.993, b = 5 and = 15.
4
1
Filtering
Figure 1.1: L
1
T ltering versus HP ltering for the model (1.2)
500 1000 1500 2000
50
0
50
100
Signal
t
500 1000 1500 2000
50
0
50
100
Noisy signal
t
500 1000 1500 2000
50
0
50
100
L
1
-T lter
t
500 1000 1500 2000
50
0
50
100
HP lter
t
Figure 1.2: L
1
-T ltering versus HP ltering for the model (1.3)
500 1000 1500 2000
0
500
1000
1500
Signal
t
500 1000 1500 2000
0
500
1000
1500
Noisy signal
t
500 1000 1500 2000
0
500
1000
1500
L
1
-T lter
t
500 1000 1500 2000
0
500
1000
1500
HP lter
t
5
1
Filtering
condition to the rst derivative, we can expect to get the tted signal with zero slope.
The cost of this penalty will be proportional to the number of jumps. In this case,
we would like to minimize the following objective function:
1
2
n
t=1
(y
t
x
t
)
2
+
n
t=2
[x
t
x
t1
[
or in the vectorial form:
1
2
|y x|
2
2
+|Dx|
1
Here the D operator is (n 1) n matrix which is the discrete version of the rst
order derivative:
D =
_
_
1 1 0
0 1 1 0
.
.
.
1 1 0
1 1
_
_
(1.4)
We may apply the same minimization algorithm as previously (see Appendix A.1.1).
To illustrate that, we consider the model with step trend lines perturbed by a white
noise process:
_
_
y
t
= x
t
+
t
t
A
_
0,
2
_
Pr x
t
= x
t1
= p
Pr
_
x
t
= b
_
|
[0,1]
1
2
__
= 1 p
(1.5)
We employ this model for testing the L
1
C ltering and HP ltering adapted to
the rst derivative
6
, which corresponds to the following optimization program:
min
1
2
n
t=1
(y
t
x
t
)
2
+
n
t=2
(x
t
x
t1
)
2
In Figure 1.3, we have reported the corresponding results
7
. For the second test,
we consider a mean-reverting process (Ornstein-Uhlenbeck process) with mean value
following a regime switching process:
_
_
y
t
= y
t1
+(x
t
y
t1
) +
t
t
A
_
0,
2
_
Pr x
t
= x
t1
= p
Pr
_
x
t
= b
_
|
[0,1]
1
2
__
= 1 p
(1.6)
Here,
t
is the process which characterizes the mean value and is inversely propor-
tional to the return time to the mean value. In Figure 1.4, we show how the L
1
C
lter can capture the original signal in comparison to the HP lter
8
.
6
We use the term HP lter in order to keep homogeneous notations. However, we notice that
this lter is indeed the FLS lter proposed by Kalaba and Tesfatsion (1989) when the exogenous
regressors are only a constant.
7
The parameters are p = 0.998, b = 50 and = 8.
8
For the simulation of the Ornstein-Uhlenbeck process, we have chosen p = 0.9985, b = 20,
= 0.1 and = 2
6
1
Filtering
Figure 1.3: L
1
C ltering versus HP ltering for the model (1.5)
500 1000 1500 2000
40
20
0
20
40
60
80
Signal
t
500 1000 1500 2000
40
20
0
20
40
60
80
Noisy signal
t
500 1000 1500 2000
40
20
0
20
40
60
80
L
1
-C lter
t
500 1000 1500 2000
40
20
0
20
40
60
80
HP lter
t
Figure 1.4: L
1
C ltering versus HP ltering for the model (1.6)
500 1000 1500 2000
20
10
0
10
20
30
40
Signal
t
500 1000 1500 2000
20
10
0
10
20
30
40
Noisy signal
t
500 1000 1500 2000
20
10
0
10
20
30
40
L
1
-C lter
t
500 1000 1500 2000
20
10
0
10
20
30
40
HP lter
t
7
1
Filtering
1.3.3 Mixing trend and mean-reverting properties
We now combine the two schemes proposed above. In this case, we dene two regular-
ization parameters
1
and
2
corresponding to two penalty conditions
n1
t=1
[x
t
x
t1
[
and

n1
t=2
[x
t1
2x
t
+x
t+1
[. Our objective function for the primal problem be-
comes now:
1
2
n
t=1
(y
t
x
t
)
2
+
1
n1
t=1
[x
t
x
t1
[ +
2
n1
t=2
[x
t1
2x
t
+x
t+1
[
which can be again rewritten in the matrix form:
1
2
|y x|
2
2
+
1
|D
1
x|
1
+
2
|D
2
x|
1
where the D
1
and D
2
operators are respectively the (n 1) n and (n 2) n
matrices dened in equations (1.4) and (1.1).
In Figures 1.5 and 1.6, we test the eciency of the mixing scheme on the straight
trend lines model (1.2) and the random walk model (1.3)
9
.
Figure 1.5: L
1
TC ltering versus HP ltering for the model (1.2)
500 1000 1500 2000
100
50
0
50
100
Signal
t
500 1000 1500 2000
100
50
0
50
100
Noisy signal
t
500 1000 1500 2000
100
50
0
50
100
L
1
-TC lter
t
500 1000 1500 2000
100
50
0
50
100
HP lter
t
1.3.4 How to calibrate the regularization parameters?
As shown above, the trend obtained from L
1
ltering depends on the parameter of
the regularization procedure. For large values of , we obtain the long-term trend of
9
For both models, the parameters are p = 0.99, b = 0.5 and = 5.
8
1
Filtering
Figure 1.6: L
1
TC ltering versus HP ltering for the model (1.3)
500 1000 1500 2000
500
0
500
1000
1500
Signal
t
500 1000 1500 2000
500
0
500
1000
1500
Noisy signal
t
500 1000 1500 2000
500
0
500
1000
1500
L
1
-TC lter
t
500 1000 1500 2000
500
0
500
1000
1500
HP lter
t
the data while for small values of , we obtain short-term trends of the data. In this
paragraph, we attempt to dene a procedure which permits to do the right choice
on the smoothing parameter according to our need of trend extraction.
A preliminary remark
For small value of , we recover the original form of the signal. For large value of
, we remark that there exists a maximum value
max
above which the trend signal
has the ane form:
x
t
= +t
where and are two constants which do not depend on the time t. The value of
max
is given by:
max
=
_
_
_
_
_
DD
_
1
Dy
_
_
_
_
We can use this remark to get an idea about the order of magnitude of which
should be used to determine the trend over a certain time period T. In order to
show this idea, we take the data over the total period T. If we want to have the
global trend on this period, we x =
max
. This will gives the unique trend for
the signal over the whole period. If one need to get more detail on the trend over
shorter periods, we can divide the signal into p time intervals and then estimate
9
1
Filtering
via the mean value of all the
i
max
parameter:
=
1
p
p
i=1
i
max
In Figure 1.7, we show the results obtained with p = 2 ( = 1 500) and p = 6
( = 75) on the S&P 500 index.
Figure 1.7: Inuence of the smoothing parameter
2007 2008 2009 2010 2011
6.6
6.8
7
7.2
7.4
7.6

S&P 500
=999
=15
Moreover, the explicit calculation of a Brownian motion process gives us the
scaling law of the the smoothing parameter
max
. For the trend ltering scheme,
max
scales as T
5/2
while for the mean-reverting scheme,
max
scales as T
3/2
(see
Figure 1.8). Numerical calculation of these powers for 500 simulations of the model
(1.3) gives very good agreement with the analytical result for Brownian motion.
Indeed, we obtain empirically that the power for L
1
T lter is 2.51 while the one
for L
1
C lter is 1.52.
Cross validation procedure
In this paragraph, we discuss how to employ a cross-validation scheme in order
to calibrate the smoothing parameter of our model. We dene two additional
parameters which characterize the trend detection mechanism. The rst parameter
T
1
is the width of the data windows to estimate the optimal with respect to our
target strategy. This parameter controls the precision of our calibration. The second
parameter T
2
is used to estimate the prediction error of the trends obtained in the
10
1
Filtering
Figure 1.8: Scaling power law of the smoothing parameter
max
main window. This parameter characterizes the time horizon of the investment
strategy. Figure 3.7 shows how the data set is divided into dierent windows in the
Figure 1.9: Cross-validation procedure for determining optimal value
[
-
|
[
-
T
1
Training set
[
-
T
2
Test set
[
-
T
2
Forecasting
Historical data Today Prediction
cross validation procedure. In order to get the optimal parameter , we compute the
total error after scanning the whole data by the window T
1
. The algorithm of this
calibration process is described as following:
11
1
Filtering
Algorithm 1 Cross validation procedure for L
1
ltering
procedure CV_Filter(T
1
, T
2
)
Divide the historical data by m rolling test sets T
i
2
(i = 1, . . . , m)
For each test window T
i
2
, compute the statistic
i
max
From the array of
_
i
max
_
, compute the average

and the standard deviation
Compute the boundaries

1
=

2
and
2
=

+ 2
for j = 1 : n do
Compute
j
=
1
(
2
/
1
)
(j/n)
Divide the historical data by p rolling training sets T
k
1
(k = 1, . . . , p)
for k = 1 : p do
For each training window T
k
1
, run the L
1
lter
Forecast the trend for the adjacent test window T
k
2
Compute the error e
k
(
j
) on the test window T
k
2
end for
Compute the total error e (
j
) =
m
k=1
e
k
(
j
)
end for
Minimize the total error e () to nd the optimal value
Run the L
1
lter with =
end procedure
Figure 1.10 illustrates the calibration procedure for the S&P 500 index with
T
1
= 400 and T
2
= 50 for the S&P 500 index (the number of observations is equal
to 1 008 trading days). With m = p = 12 and n = 15, the estimated optimal value
for the L
1
T lter is equal to 7.03.
We have observed that this calibration procedure is more favorable for long-term
time horizon, that is to estimate a global trend. For short-term time horizon, the
prediction of local trends is much more perturbed by the noise. We have computed
the probability of having good prediction on the tendency of the market for long-
term and short-term time horizons. This probability is about 70% for 3 months time
horizon while it is just 50% for one week time horizon. It comes that even if the t is
good for the past, the noise is however large meaning that the prediction of the future
tendency is just
1
/2 for an increasing market and
1
/2 for a decreasing market. In order
to obtain better results for smaller time horizons, we improve the last algorithm by
proposing a two-trend model. The rst trend is the local one which is determined by
the rst algorithm with the parameter T
2
corresponding to the local prediction. The
second trend is the global one which gives the tendency of the market over a longer
period T
3
. The choice of this global trend parameter is very similar to the choice of
the moving-average parameter. This model can be considered as a simple version of
mean-reverting model for the trend. In Figure 1.11, we describe how the data set is
divided for estimating the local trend and the global trend.
The procedure for estimating the trend of the signal in the two-trend model is
summarized in Algorithm 2. The corrected trend is now determined by studying the
relative position of the historical data to the global trend. The reference position is
characterized by the standard deviation
_
y
t
x
G
t
_
where x
G
t
is the ltered global
12
1
Filtering
Figure 1.10: Calibration procedure with the S&P 500 index
2 0 2 4 6 8
40
60
80
ln
e
(
)
2007 2008 2009 2010 2011
6.5
7
7.5
trend.
1.4 Application to momentum strategies
In this section, we apply the previous framework to the S&P 500 index. First, we
illustrate the calibration procedure for a given trading date. Then, we backtest a
momentum strategy by estimating dynamically the optimal lters.
1.4.1 Estimating the optimal lter for a given trading date
We would like to estimate the optimal lter for January 3rd, 2011 by considering
the period from January 2007 to December 2010. We use the previous algorithms
Figure 1.11: Cross validation procedure for two-trend model
[
-
|
[
-
T
1
Training set
[
-
T
2
[
-
T
3
Test set
[
-
T
2
[
-
T
3
Forecasting
Local trend
Global trend
13
1
Filtering
Algorithm 2 Prediction procedure for the two-trend model
procedure Predict_Filter(T
l
, T
g
)
Compute the local trend x
L
t
for the time horizon T
2
with the CV_FILTER
procedure
Compute the global trend x
G
t
for the time horizon T
3
with the CV_FILTER
procedure
Compute the standard deviation
_
y
t
x
G
t
_
of data with respect to the global
trend
if

y
t
x
G
t
<
_
y
t
x
G
t
_
then
Prediction x
L
t
else
Prediction x
G
t
end if
end procedure
with T
1
= 400 and T
2
= 50. The optimal parameters are
1
= 2.46 (for the L
1
C
lter) and
2
= 15.94 (for the L
2
T lter). Results are reported in Figure 1.12.
The trend for the next 50 trading days is estimated to 7.34% for the L
1
T lter
and 7.84% for the HP lter whereas it is null for the L
1
C and L
1
TC lters. By
comparison, the true performance of the S&P 500 index is 1.90% from January 3rd,
2011 to March 15th, 2011
10
.
Figure 1.12: Comparison between dierent L
1
lters on S&P 500 Index
10
It corresponds exactly to a period of 50 trading days
14
1
Filtering
1.4.2 Backtest of a momentum strategy
Design of the strategy
Let us consider a class of self-nanced strategies on a risky asset S
t
and a risk-free
asset B
t
. We assume that the dynamics of these assets is:
dB
t
= r
t
B
t
dt
dS
t
=
t
S
t
dt +
t
S
t
dW
t
where r
t
is the risk-free rate,
t
is the trend of the asset price and
t
is the volatility.
We denote
t
the proportion of investment in the risky asset and (1
t
) the part
invested in the risk-free asset. We start with an initial budget W
0
and expect a
nal wealth W
T
. The optimal strategy is the one which optimizes the expectation of
the utility function U (W
T
) which is increasing and concave. It is equivalent to the
Markowitz problem which consists of maximizing the wealth of the portfolio under
a penalty of risk:
sup
R
_
E(W
T
)

2
2
(W
T
)
_
which is equivalent to:
sup
R
_

2
W
0
2
t
2
t
_
As the objective function is concave, the maximum corresponds to the zero point of
the gradient
t
W
0
2
t
. We obtain the optimal solution:
t
=
1
W
0
2
t
In order to limit the explosion of
t
, we also impose the following constraint
min

t

max
:
t
= max
_
min
_
1
W
0
2
t
,
min
_
,
max
_
The wealth of the portfolio is then given by the following expression:
W
t+1
= W
t
+W
t
_
t
_
S
t+1
S
t
1
_
+ (1
t
)r
t
_
Results
In the following simulations, we use the estimators
t
and
t
in place of
t
and
t
.
For
t
, we consider dierent models like L
1
, HP and moving-average lters
11
whereas
we use the following estimator for the volatility:

2
t
=
1
T
_
T
0
2
t
dt =
1
T
t
i=tT+1
ln
2
S
i
S
i1
We consider a long/short strategy, that is (
min
,
max
) = (1, 1). In the particular
case of the
L
1
t
estimator, we consider three dierent models:
11
We note them respectively
L
1
t
,
HP
t
and
MA
t
.
15
1
Filtering
Table 1.1: Results for the Backtest
Model Trend Performance Volatility Sharpe IR Drawdown
S&P 500 2.04% 21.83% 0.06 56.78

MA
t
3.13% 18.27% 0.01 0.03 33.83

HP
t
6.39% 18.28% 0.17 0.13 39.60

L
1
t
(LT) 3.17% 17.55% 0.01 0.03 25.11

L
1
t
(GT) 6.95% 19.01% 0.19 0.14 31.02

L
1
t
(LGT) 6.47% 18.18% 0.17 0.13 31.99
1. the rst one is based on the local trend;
2. the second one is based on the global trend;
3. the combination of both local and global trends corresponds to the third model.
For all these strategies, the test set of the local trend T
2
is equal to 6 months (or
130 trading days) whereas the length of the test set for global trend is four times
the length of the test set T
3
= 4T
2
meaning that T
3
is one year (or 520 trading
days). This choice of T
3
agrees with the habitual choice of the width of the windows in
moving average estimator. The length of the training set is also four times the length
of the test set T
1
. The study period is from January 1998 to December 2010. In the
backtest, the trend estimation is updated every day. In Table 2.3, we summarize the
results obtained with the dierent models cited above for the backtest. We remark
that the best performances correspond to the case of global trend, HP and two-trend
models. Because HP lter is calibrated to the window of the moving-average lter
which is equal to T
3
, it is not surprising that the performances of these three models
are similar. On the considered period of the backtest, the S&P does not have a clear
upward or downward trend. Hence, the local trend estimator does not give a good
prediction and this strategy gives the worst performance. By contrast, the two-trend
model takes into account the trade-o between local trend and global trend and gives
a better result
1.5 Extension to the multivariate case
We now extend the L
1
ltering scheme to a multivariate time series y
t
=
_
y
(1)
t
, . . . , y
(m)
t
_
.
The underlying idea is to estimate the common trend of several univariate time se-
ries. In nance, the time series correspond to the prices of several assets. Therefore,
we can build long/short strategies between these assets by comparing the individual
trends and the common trend.
For the sake of simplicity, we assume that all the signals are rescaled to the same
16
1
Filtering
order of magnitude
12
. The objective function becomes new:
1
2
m
i=1
_
_
_y
(i)
x
_
_
_
2
2
+|Dx|
1
In Appendix A.1.1, we show that this problem is equivalent to the L
1
univariate
problem by considering y
t
= m
1
m
i=1
y
(i)
as the signal.
1.6 Conclusion
Momentum strategies are ecient ways to use the market tendency for building trad-
ing strategies. Hence, a good estimator of the trend is essential from this perspective.
In this paper, we show that we can use L
1
lters to forecast the trend of the mar-
ket in a very simple way. We also propose a cross-validation procedure to calibrate
the optimal regularization parameter where the only information to provide is the
investment time horizon. More sophisticated models based on a local and global
trends is also discussed. We remark that these models can reect the eect of mean-
reverting to the global trend of the market. Finally, we consider several backtests
on the S&P 500 index and obtain competing results with respect to the traditional
moving-average lter.
12
For example, we may center and standardize the time series by subtracting the mean and
dividing by the standard deviation.
17
Bibliography
[1] Alexandrov T., Bianconcini S., Dagum E.B., Maass P. and McElroy
T. (2008), A Review of Some Modern Approaches to the Problem of Trend
Extraction , US Census Bureau, RRS #2008/03.
[2] Beveridge S. and Nelson C.R. (1981), A New Approach to the Decompo-
sition of Economic Time Series into Permanent and Transitory Components
with Particular Attention to Measurement of the Business Cycle, Journal of
Monetary Economics, 7(2), pp. 151-174.
[3] Boyd S. and Vandenberghe L. (2009), Convex Optimization, Cambridge Uni-
versity Press.
[4] Cleveland W.P. and Tiao G.C. (1976), Decomposition of Seasonal Time Se-
ries: A Model for the Census X-11 Program, Journal of the American Statistical
Association, 71(355), pp. 581-587.
[5] Daubechies I., Defrise M. and De Mol C. (2004), An Iterative Thresholding
Algorithm for Linear Inverse Problems with a Sparsity Constraint, Communi-
cations on Pure and Applied Mathematics, 57(11), pp. 1413-1457.
[6] Efron B., Tibshirani R. and Friedman R. (2009), The Elements of Statistical
Learning, Second Edition, Springer.
[7] Harvey A. (1991), Forecasting, Structural Time Series Models and the Kalman
Filter, Cambridge University Press.
[8] Hodrick R.J. and Prescott E.C. (1997), Postwar U.S. Business Cycles: An
Empirical Investigation, Journal of Money, Credit and Banking, 29(1), pp. 1-16.
[9] Kalaba R. and Tesfatsion L. (1989), Time-varying Linear Regression via
Flexible Least Squares, Computers & Mathematics with Applications, 17, pp.
1215-1245.
[10] Kim S-J., Koh K., Boyd S. and Gorinevsky D. (2009),
1
Trend Filtering,
SIAM Review, 51(2), pp. 339-360.
[11] Tibshirani R. (1996), Regression Shrinkage and Selection via the Lasso, Jour-
nal of the Royal Statistical Society B, 58(1), pp. 267-288.
19
Chapter 2
Volatility Estimation for Trading
Strategies
We review in this chapter various techniques for estimating the volatility. We start
by discussing the estimators based on the range of daily monitoring data then we con-
sider the stochastic volatility model in order to determine the instantaneous volatility.
At high trading frequency, the stock prices are uctuated by an additional noise, so-
called the micro-structure noise. This eect comes from the bid-ask bounce due to
the short time scale. Within a short time interval, the trading price does not con-
verge to the equilibrium price determined by the supply-demand equilibrium. In
the second part, we discuss the eect of the micro-structure noise on the volatility es-
timation. It is very important topic concerning an enormous eld of high-frequency
trading. Examples of backtesting on index and stocks will illustrate the eciency of
considered techniques.
Keywords: Volatility, voltarget strategy, range-based estimator, high-low estima-
tor, microstructure noise.
2.1 Introduction
Measuring the volatility is one of the most important questions in nance. As stated
in its name, volatility is the direct measurement of the risk for a given asset. Under
the hypothesis that the realized return follows a Brownian motion, volatility is usually
estimated by the standard deviation of daily price movement. As this assumption
relates the stock price to the most common object of stochastic calculus, many
mathematical work have been carried out on the volatility estimation. With the
increasing of the trading data, we can explore more and more useful information in
order to improve the precision of the volatility estimator. New class of estimators
which are based on the high and low prices was invented. However, in the real world
the asset price is just not a simple geometric Brownian process, dierent eects
have been observed including the drift or the opening jump. A general correction
21
Volatility Estimation for Trading Strategies
scheme based on the combination of various estimators have been studied in order
to eliminate these eects.
As far as the trading frequency increases, we expect that the precision of estimator
gets better as well. However, when the trading frequency reaches certain limit
1
, new
phenomena due to the nonequlibrum of the market emerge and spoil the precision. It
is called the micro-structure noise which is characterized by the bid-ask bounce or the
transaction eect. Because of this noise, realized variance estimator overestimates
the true volatility of the price process. A suggestion based on the use of two dierent
time scales can aim to eliminate this eect.
The note is organized as following. In Section II, we review the basic volatility
estimator using the variance of realized return (note from B.Bruder article) then we
introduce all the variation based on the range estimation. In section III, we discuss
how to measure the instantaneous volatility and the eect of the lag by doing the
moving-average. In section IV, we discuss the eect of the microstructure on the
high frequency volatility.
2.2 Range-based estimators of volatility
2.2.1 Range based daily data
In this paragraph, we discuss the general characteristics of the asset price and intro-
duce the basic notations which will be used for the rest of the article. Let us assume
that the dynamics of asset price follows the habitual Black-Scholes model. We denote
the asset price S
t
which follows a geometric Brownian motion in continuous time:
dS
t
S
t
=
t
dt +
t
dB
t
(2.1)
Here,
t
is the return or the drift of the process whereas
t
is the volatility. Over the
period of T = 1 trading day, the evolution is divided in two time intervals: the rst
interval with ratio f describes the closing interval (before opening) and the second
interval with ratio 1 f describes the opening interval (trading interval). On the
monitoring of the data, the closing interval is unobservable and is characterized by the
jumps in the opening of the market. The measure of closing interval is not given by
the real closing time but the jumps in the opening of the market. If the logarithm of
price follows a standard Brownian motion without drift, then the fraction f/ (1 f)
is given by the square of ratio between the standard deviation of the opening jump
and the daily price movement. We will see that this idea can give a rst correction
due to the close-open eect for all the estimators discussed below.
In order to x the notation, we dene here dierent quantities concerning the
statistics of the price evolution:
T is the time interval of 1 trading day
1
This limit denes the optimal frequency for the classical estimator. It is more and less agreed
to be one trade every 5 minutes.
22
Figure 2.1: Data set of 1 trading day
f is the fraction of closing period

2
t
is the estimator of the variance
2
t
O
t
i
is the closing price on a given period [t
i
, t
i+1
[
C
t
i
is the closing price on a given period [t
i
, t
i+1
[
H
t
i
= max
t[t
i
,t
i+1
[
S
t
is the highest price on a given period [t
i
, t
i+1
[
L
t
i
= min
t[t
i
,t
i+1
[
S
t
is the lowest price on a given period [t
i
, t
i+1
[
o
t
i
= ln O
t
i
ln C
t
i1
is the opening jump
u
t
i
= ln H
t
i
ln O
t
i
is the highest price movement during the trading open
d
t
i
= ln L
t
i
ln O
t
i
is the lowest price movement during the trading open
c
t
i
= ln C
t
i
ln O
t
i
is the daily price movement over the trading open period
2.2.2 Basic estimator
For the sake of simplicity, let us start this paragraph by assuming that there is no
opening jump f = 0. The asset price S
t
described by the process (3.17) is observed
in a series of discrete dates t
0
, ..., t
n
. In general, this series is not necessary regular.
Let R
t
i
be the realized return in the period [t
i1
, t
i
[, then we obtain:
R
t
i
= ln S
t
i
ln S
t
i1
=
_
t
i
t
i1
_
u
dB
u
+
u
du
1
2
2
u
du
_
In the following, we assume that the couple (
t
,
t
) is independent to the Brownian
motion B
t
of the asset price evolution.
23
Estimator over a given period
In appendix B.1, we show that the realized return R
t
i
is related to the volatility as:
E
_
R
2
t
i
[,
= (t
i
t
i1
)
2
t
i
+ (t
i
t
i1
)
2
_
t
i1

1
2
2
t
i1
_
2
This quantity can not be a good estimator of volatility because its standard deviation
is

2 (t
i+1
t
i
)
2
t
i
which is proportional to the estimated quantity. In order to
reduce the estimation error, we focus on the estimation of the average volatility over
the period t
n
t
0
. The average volatility is dened as:
2
=
1
t
n
t
0
_
t
n
t
0
2
u
du (2.2)
This quantity can be measured by using the canonical estimator dened as:

2
=
1
t
n
t
0
n
i=1
R
2
t
i
The variance of this estimator is approximated as var
_

2
_
2
4
/n or the standard
deviation is proportional to

2
2
/
n. It means that the estimation error is small if

n is large enough. Indeed the variance of the average volatility reads var
_

2
_

2
/ (2n) and the standard deviation is approximated to /
2n.
Eect of the weight distribution
In general, we can dene an estimator with a weight distribution w
i
such as:

2
=
n
i=1
w
i
R
2
t
i
then the expectation value of the estimator is given by:
E
_

2
[,
=
n
i=1
_
t
i
t
i1
w
i
2
u
du
A simple example of the general denition is the estimator with annualized return
R
i
/
t
i+1
t
i
. In this case, our estimator becomes:

2
=
1
n
n
i=1
R
2
t
i
t
n
t
1
for which the expectation value is:
E
_

2
[,
=
n
i=1
1
t
i
t
i1
_
t
i
t
i1
2
u
du (2.3)
24
We remark that if the time step (time increment) is constant t
i
t
i1
= T, then we
obtain the same result as the canonical estimator. However, if the time step t
i
t
i1
is not constant, the long-term return is underweighted while the short-term return
is overweighted. We will see in the next discussion on the realized volatility, the way
of choosing the weight distribution can help to improve the quality of the estimator.
For example, we will show that the IGARCH estimation can lead to an exponential
weight distribution which is more appropriate to estimate the realized volatility.
Close to close, open to close estimators
As discussed above, the volatility can be obtained by an using moving-average on
discrete ensemble data. The standard measurement is to employ the above result of
the canonical estimator for the closing prices (so-called close to close estimator):

2
CC
=
1
(n 1) T
n
i=1
((o
t
i
+c
t
i
) (o +c))
2
Here, T is the time period corresponding to 1 trading day. In the rest of the paper,
we user CC to denote the close to close estimator. We remark that in this formula,
there are two dierent points in comparison to the one dened above. Firstly, we
have subtracted the mean value of the closing price (o +c) in order to eliminate the
drift eect:
o =
1
nT
n
i=1
o
t
i
, c =
1
nT
n
i=1
c
t
i
Secondly, the prefactor is now 1/ (n 1) T but not 1/nT. In fact, we have subtracted
the mean value then maximum likehood procedure leads to the factor 1/ (n 1) T.
We can dene also two other volatility estimators which is open to close estimator
(OC):

2
C
=
1
(n 1) T
n
i=1
(c
t
i
c)
2
and the close to open estimator (CO):

2
O
=
1
(n 1) T
n
i=1
(o
t
i
o)
2
We remind that o
t
i
is the opening jump for a given trading period, c
t
i
is the daily
movement of the asset price such that the close to close return is equal to (o +c).
We remark that the close to close estimator does not depend on the drift and the
closing interval f. Without presence of the microstructure noise, this estimator is
unbiased. Hence, it is usually used as a benchmark to judge the eciency of other
estimators which is dened as:
e
_

2
_
=
var
_

2
CC
_
var (
2
)
where var
_

2
_
= 2
4
/n. The quality of an estimator is determined by its high value
of eciency e
_

2
_
> 1.
25
2.2.3 High-low estimators
We have seen that the daily deviation can be used to dene the estimator of the
volatility. It comes from the fact that one has assumed that the logarithm of price
follows a Brownian motion. We all know that the standard deviation in the diusive
process over an interval time
t
is proportional to
t
, hence using the variance
to estimate the volatility is quite intuitive. Indeed, within a given time interval, if
additional information of the price movement is available such as the highest value or
the lowest value, this range must provide as well a good measure of the volatility. This
idea is rst addressed by W. Feller in 1951. Later, Parkinson (1980) has employed
the rst result of Fellers work to provide the rst high-low estimator (so-called
Parkinson estimator). If one uses close prices to estimate the volatility, one can
eliminate the eect of the drift by subtracting the mean value of daily variation.
By contrast, the use of high and low prices can not eliminate the drift eect in
such a simple way. In addition, the high and low prices can be only observed in the
opening interval, then it can not eliminate the second eect due to the opening jump.
However, as demonstrated in the work of Parkinson (1980), this estimator gives a
better condence but it obviously underestimate the volatility because of the discrete
observation of the price. The maximum and minimum value over a time interval
t
are not the true ones of the Brownian motion. They are underestimated then it
is not surprising that the result will depend strongly on the frequency of the price
quotation. In the high frequency market, the third eect can be negligible however
we will discuss this eect in the later. Because of the limitation of Parkinsons
estimator, an other estimator which is also based on the work of Feller was proposed
by Kunitomo (1992). In order to eliminate the drift, he construct a Brownian bridge
then the deviation of this motion is again related to the diusion coecient. In the
same line of thought, Rogers and Satchell (1991) propose an other use of high and
low prices in order to obtain a drift-independent volatility estimator. In this section,
we review the three techniques which are always constrained by the opening jump.
The Parkinson estimator
Let us consider the random variable u
t
i
d
t
i
(namely the range of the Brownian
motion over the period [t
i
, t
i+1
[), then the Parkinson estimator is dened by using
the following result (Feller 1951):
E
_
(u d)
2
_
= (4 ln 2)
2
T
By inversing this formula, we obtain a natural estimator of volatility based on high
and low prices. The Parkinsons volatility estimator is then dened as (Parkinson
1980):

2
P
=
1
nT
n
i=1
1
4 ln 2
(u
t
i
d
t
i
)
2
26
In order to estimate the error of the estimator, we compute the variance of
2
P
which
is given by the following expression:
var
_

2
P
_
=
_
9 (3)
16 (ln 2)
2
1
_

4
n
Here, (x) is the Riemann function. In comparison to the benchmark estimator
close to close , we have an eciency:
e
_

2
P
_
=
32 (ln 2)
2
9 (3) 16 (ln 2)
2
= 4.91
The Garman-Klass estimator
Another idea employing the additional information from the high and low value of the
price movement within the trading day in order to increase the estimator eciency
was introduced by Garman and Klass (1980). They construct a best analytic scale
estimator by proposing a quadratic form estimator and imposing the well-known in-
variance condition of Brownian motion on the set of variable (u, d, c). By minimizing
its variance, they obtain the optimal variational form of quadratic estimator which
is given by the following property:
E
_
0.511 (u d)
2
0.019 (c (u +d) 2ud) 0.383c
2
_
=
2
T
Then the Garman-Klass estimator is dened as:

2
GK
=
1
nT
n
i=1
_
0.511 (u
t
i
d
t
i
)
2
0.019 (c
t
i
(u
t
i
+d
t
i
) 2u
t
i
d
t
i
) 0.383c
2
t
i
_
The minimal value of the variance corresponding to the quadratic estimator is var
_
2
GK
_
=
0.27
4
/n and its eciency is now e
_
2
GK
_
= 7.4.
The Kunitomo estimator
Let X
t
the logarithm of price process X
t
= ln S
t
, the Ito theorem gives us its
evolution:
dX
t
=
_

2
t
2
_
dt +
t
dB
t
If the drift term becomes relevant in the estimation of volatility, one can eliminate
it by constructing a Brownian bridge on the period T as following:
W
t
= X
t
t
T
X
T
If the initial condition is normalized to X
0
= 0, then by denition we always have
X
T
= 0. This construction eliminates automatically the drift term when its daily
variation is small
t
i+1

t
i

t
i
. We dene the range of the Brownian bridge
27
D
t
i
= M
t
i
m
t
i
where M
t
i
= max
t[t
i
,t
i+1
[
W
t
and m
t
i
= min
t[t
i
,t
i+1
[
W
t
. It has
been demonstrated that the variance of the range of Brownian bridge is directly
proportional to the volatility (Feller 1951):
E
_
D
2
= T
2
2
/6 (2.4)
Hence, Kunimotos estimator is dened as following:

2
K
=
1
nT
n
i=1
6
2
(M
t
i
m
t
i
)
2
Higher moment of the Brownian bridge can be also calculated analytically and is
given by the formula 2.10 in Kunitomo (1992). In particular, the variance of the
Kunitomos estimator is equal to var
_
2
K
_
=
4
/5n which implies the eciency of
this estimator e
_
2
K
_
= 10.
The Rogers-Satchell estimator
Another way to eliminate the drift eect is proposed by Rogers and Satchell. They
consider the following property of the Brownian motion:
E[u(u c) +d (d c)] =
2
T
This expectation value does not depend on the drift of the Brownian motion, hence
it does provide a drift-independent estimator which can be dened as:

2
RS
=
1
nT
n
i=1
[u
t
i
(u
t
i
c
t
i
) +d
t
i
(d
t
i
c
t
i
)]
The variance of this estimator is given by var
_

2
RS
_
= 0.331
4
/n which gives an
eciency e
_

2
RS
_
= 6.
Like the other techniques based on the range high-low, this estimator underes-
timates the volatility due to the fact that the maximum of a discretized Brownian
motion is smaller than the true value. Rogers and Satchell have also proposed a cor-
rection scheme which can be generalized for other technique. Let M be the number
of quoted price, then h = T/M is the step of the discretization, then the corrected
estimator taking account of the nite step error is give by the root of the following
equation:

2
h
= 2bh
2
h
+ 2 (u d) a
h
h
+
2
RS
where a =
2
_
1/4
_
2 1
_
/6
and b = (1 + 3/4) /12.

2.2.4 How to eliminate both drift and opening eects?
A common way to eliminate both eects coming from the drift and the opening
jump is to combine the various available volatility estimators. The general scheme
28
is to form a linear combination of opening estimator
O
and close estimator
C
or
a high-low estimator
HL
. The coecients of this combination are determined by a
minimization procedure on the variance of the result estimator. Given the faction
of closing interval f, we can improve all high-low estimators discussed above by
introducing the combination:

2
=

2
O
f
+ (1 )

2
HL
1 f
Here, the trivial choice is = f and the estimator becomes independent of the
opening jump. However, the optimal value of the coecient is chosen as = 0.17
for Parkinson and Kunimoto estimators whereas it value is = 0.12 for Garman-
Klass estimator (Garman and Klass 1980). This technique can eliminate the eect
of the opening jump for all estimator but only Kunimoto estimator can avoid both
eects.
Applying the same idea, Yang and Zhang (2000) have proposed another combi-
nation which can also eliminate both eect as Kunimoto estimator. They choose the
following combination:

2
Y Z
=

2
O
f
+
1
1 f
_

2
C
+ (1 )
2
HL
_
In the work of Yang ans Zhang, they have used
2
RS
as high-low estimator because
it is drift independent estimator. The coecient will be chosen as = f and
is given by optimizing the variance of estimator. The minimization procedure gives
the optimal value of the parameter :
o
=
1
+
n+1
n1
where = E
_
(u(u c) +d (d c))
2
_
/
4
(1 f)
2
. As the numerator is proportional
to (1 f)
2
, is in dependent of f. Indeed, the value of varies not too much (from
1.331 to 1.5) when the drift is changed. In practice, the value of is chosen as 1.34.
2.2.5 Numerical simulations
Simulation with constant volatility
We test various volatility estimators via a simulation of a geometric Brownian motion
with constant annualized drift = 30% and constant annualized volatility = 15%.
We realize the simulation based on N = 1000 trading days with M = 50 or 500
intra-day observations in order to illustrate the eect of the discrete price on the
family of high-low estimators.
Eect of the discretization
We rst test the eect of the discretization on the various estimators. Here,
29
we take M = 50 or 500 intraday observations with = 0 and f = 0. In Figure
2.2, we present the simulation results for M = 50 price quotation in a trad-
ing day. All the high-low estimators are weakly biased due the discretization
eect. They all underestimate the volatility as the range of estimator is small
than the true range of Brownian motion. We remark that the close-to-close is
unbiased however its variance is too large. The correction scheme proposed by
Roger and Satchell can eliminate the discretization eect. When the number
of observation is larger, the discretization eect is negligible and all estimators
are unbiased (see Figure 2.3).
Figure 2.2: Volatility estimators without drift and opening eects (M = 50)
0 100 200 300 400 500 600 700 800 900 1000
10
11
12
13
14
15
16
17
18
19
20
(
%
)

Simulated , CC, OC, P, K, GK, RS, RSh, YZ
Eect of the non-zero drift
We consider now the case with non-zero annual drift = 30%. Here, we take
M = 500 intraday observations. In Figure 2.4, we observe that the Parkinson
estimator and the Garman-Klass estimator are strongly dependent on the drift
of Brownian motion. Kunimoto estimator and Rogers-Satchell estimator are
not dependent on the drift.
Eect of the opening jump
For the eect of the opening jump, we simulate data with f = 0.3. In Figure
2.4, we take M = 500 intraday observations with zero drift = 0. We observe
that with the eect of the opening jump, all high-low estimator underestimate
the volatility except for the YZ estimator. By using the combination between
the open volatility estimator
2
O
with the other estimators, the eect of the
opening can be completely eliminated (see Figure 2.6).
30
Figure 2.3: Volatility estimators without drift and opening eect (M = 500)
0 100 200 300 400 500 600 700 800 900 1000
10
11
12
13
14
15
16
17
18
19
20
(
%
)

Figure 2.4: Volatility estimators with = 30% and without opening eect (M = 500)
0 100 200 300 400 500 600 700 800 900 1000
12
14
16
18
20
22
24
26
(
%
)

31
Figure 2.5: Volatility estimators with opening eect f = 0.3 and without drift
(M = 500)
0 100 200 300 400 500 600 700 800 900 1000
9
10
11
12
13
14
15
16
17
18
19
20
(
%
)

Figure 2.6: Volatility estimators with correction of the opening jump (f = 0.3)
32
Simulation with stochastic volatility
We consider now the simulation with stochastic volatility which is described by the
following model:
_
dS
t
=
t
S
t
dt +
t
S
t
dB
t
d
2
t
=
2
t
dB
t
(2.5)
in which B
t
is a Brownian motion independent to the one of asset process.
We will rst estimate the volatility with all the proposed estimators then verify
the quality of these estimators via a backtest using the voltarget strategy
2
. For
the simulation of the volatility, we take the same parameters as above with f = 0,
= 0, N = 5000, M = 500, = 0.01 and
0
= 0.4. In Figure 2.7, we present the
result corresponding to dierent estimators. We remark that the group of high-low
estimators gives a better result for volatility estimation. We can estimate the error
Figure 2.7: Volatility estimators on stochastic volatility simulation
0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000
15
20
25
30
35
40
45
50
55
(
%
)

committed for each estimator by the following formula:
=
N
t=1
(
t
t
)
2
The errors obtained for various estimators are summarized in the below Table 2.1.
We now apply the estimation of the volatility to perform the voltarget strate-
gies. The result of the this test is presented in Figure 2.8. In order to control the
2
The detail description of voltarget strategy is presented in Section Backtest
33
Table 2.1: Estimation error for various estimators
Estimator
2
CC

2
P

2
K

2
GK

2
RS

2
Y Z
N
t=1
( )
2
0.135 0.072 0.063 0.08 0.076 0.065
quality of the voltarget strategy, we compute the volatility of the voltarget strategy
obtained by each estimator. We remark that the calculation of the volatility on
the voltarget strategies is eectuated by the close-to-close estimator with the same
averaging window of 3 months (or 65 trading days). The result is reported in Fig-
ure 2.9. As shown in the gure, all estimators give more and less the same results.
If we compute the error committed by these estimators, we obtain
CC
= 0.9491,
P
= 1.0331,
K
= 0.9491,
GK
= 1.2344,
RS
= 1.2703,
Y Z
= 1.1383. This result
may comes form the fact that we have used the close-to-close estimator to calculate
the volatility of all voltarget strategies. Hence, we consider another check of the
Figure 2.8: Test of voltarget strategy with stochastic volatility simulation
0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000
0.8
1
1.2
1.4
1.6
1.8
2
2.2
2.4
2.6
t

Benchmark, CC, OC, P, GK, RS, YZ
estimation quality. We compute the realized return of the voltarget strategies:
R
V
(t
i
) = ln V
t
i
ln V
t
i1
where V
t
i
is the wealth of the voltarget portfolio. We expect that this quantity
follows a Gaussian probability distribution with volatility
= 15%. Figure 2.10

shows the probability density function (Pdf) of the realized returns corresponding
to all considered estimators. In order to have a more visible result, we compute the
dierent between the cumulative distribution function (Cdf) of each estimator and
34
Figure 2.9: Test of voltarget strategy with stochastic volatility simulation
0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000
10
15
20
25
(
%
)

CC, OC, P, K, GK, RS, YZ
the expected Cdf (see Figure 2.11). Both results conrm that the Parkinson and the
Kunitomo estimators improve the quality of the volatility estimation.
2.2.6 Backtest
Volatility estimations of S&P 500 index
We now employ the estimators discussed above for the S&P 500 index. Here, we
do not have all tick-by-tick intraday data, hence the Kunimotos estimator and the
Rogers-Satchell correction can not be applied.
We remark that the eect of the drift is almost negligible which is conrmed
by Parkinson and Garman-Klass estimators. The spontaneous opening jump is esti-
mated simply by:
f
t
=
_
1 +
_

C

O
_
2
_
We then employ the exponential-average technique to obtain a lter of this quantity.
We obtain the average value of closing interval over the considered data for S&P 500
f = 0.015 and for BBVA SQ Equity

f = 0.21. In the following, we use dierent
estimators in order to extract the signal f
t
. The trivial one is using f
t
as the predic-
tion of the opening jump, we denote

f
t
, then we contruct the habitual ones like the
moving-average

f
ma
, the exponential moving-average

f
exp
and the cumulated aver-
age

f
c
. In Figure 2.15, we show result corresponding to dierent ltered f on the
35
Figure 2.10: Comparison between dierent probability density functions
0.05 0.04 0.03 0.02 0.01 0 0.01 0.02 0.03 0.04 0.05
0
5
10
15
20
25
30
35
40
45
P
d
f
R
V

Expected Pdf, CC, OC, P, K, GK, RS, YZ
Figure 2.11: Comparison between the dierent cumulative distribution functions
0.06 0.04 0.02 0 0.02 0.04 0.06
0.02
0.01
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
C
d
f
R
V

CC, OC, P, K, GK, RS, YZ
36
Figure 2.12: Volatility estimators on S& P 500 index
01/2001 01/2003 01/2005 01/2007 01/2009 01/20011
10
20
30
40
50
60
70
80
90
100
(
%
)

CO, CC, OC, P, GK, RS, YZ
Figure 2.13: Volatility estimators on on BHI UN Equity
01/2001 01/2003 01/2005 01/2007 01/2009 01/2011
10
20
30
40
50
60
70
80
(
%
)

CO, CC, OC, P, GK, RS, YZ
37
Figure 2.14: Estimation of the closing interval for S&P 500 index
01/2001 01/2003 01/2005 01/2007 01/2009 01/2011
0
0.05
0.1
0.15
f

Realized closing ratio
Moving average
Exponential average
Cummulated average
Average
Figure 2.15: Estimation of the closing interval for BHI UN Equity
01/2001 01/2003 01/2005 01/2007 01/2009 01/2011
0.2
0.4
0.6
0.8
f

Realized closing ratio
Moving average
Exponential average
Cummulated average
Average
38
BHI UN Equity data. Figure 2.13 shows that the family of high-low estimator give
a better result than the calissical close-to-close estimator. In order to check the qual-
ity of these estimators on the prediction of the volatility, we checke the value of the
Likehood function corresponding to each estimator. Assuming that the observable
signal follows the Gaussian distribution, the likehood function is dened as:
l() =
n
2
ln 2
1
2
n
i=1
ln
2
i

1
2
n
i=1
_
R
i+1
i
_
2
where R is the future realized return. In Figure 2.17, we present the result of the
likehood function for dierent estimators. This function reaches its maximal value
for the Roger-Satchell estimator.
Figure 2.16: Likehood function for various estimators on S&P 500
CC OC P GK RS YZ
1.94
1.95
1.96
1.97
1.98
x 10
4
Backtest on voltarget strategy
We now backtest the eciency of various volatility estimators with vol-target strategy
on S&P 500 index and an individual stock. Within the vol-target strategy, the
exposition to the risky asset is determined by the following expression:
t
=

t
where
is the expected volatility of the strategy and

t
is the prediction of the
volatility given by the estimators above. In the backtest, we take the annualized
volatility
= 15% with historical data since 01/01/2001 to 31/12/2011. We present

the results for two cases:
39
Figure 2.17: Likehood function for various estimators on BHI UN Equity
CC OC P GK RS YZ
1.794
1.796
1.798
1.8
1.802
1.804
1.806
x 10
4
Backtest on S&P 500 index with moving-average equal to 1 month (n = 21)
of historical data. We remark in this case that the volatility of the index is
small then the error on the volatility estimation causes less eect. However, the
high-low estimators suer the eect of discretization then they underestimate
the volatility. For the index, this eect is more important therefore the close-
to-close estimator gives the best performance.
Backtest on single asset with moving-average equal to 1 month (n = 21) of
historical data. In the case with a particular asset such as the BBVA SQ
Equity, the volatility is important hence the error due the eciency of volatility
estimators are important. High-low estimators now give better results than the
classical one.
In order to illustrate the eciency of the range-based estimators, we realize a
ranking between high-low estimator and the benchmark estimator close-to-close. We
apply the voltarget strategy for close-to-close estimator
2
CC
and a high-low esti-
mator
2
HL
. Then we compare the Sharpe ratio obtained by these two estimators
and compute the number of times where the high-low estimator gives better perfor-
mance over the ensemble of stocks. The result over S&P 500 index and its rst 100
compositions is summarized in Table 2.3.
40
Figure 2.18: Backtest of voltarget strategy on S&P 500 index
01/2001 01/2003 01/2005 01/2007 01/2009 01/2011
0.5
0.6
0.7
0.8
0.9
1
1.1
1.2
1.3

S&P 500, CC, OC, P, GK, RS, YZ
Figure 2.19: Backtest of voltarget strategy on BHI UN Equity
01/2001 01/2003 01/2005 01/2007 01/2009 01/2011
1
1.5
2
2.5
3

Benchmark, CC, OC, P, GK, RS, YZ
41
Table 2.2: Performance of
2
HL
versus
2
CC
for dierent averaging windows
Estimator
2
P

2
GK

2
RS

2
Y Z
6 month 56.2% 52.8% 52.8% 57.3%
3 month 52.8% 49.4% 51.7% 53.9%
2 month 60.7% 60.7% 60.7% 56.2%
1 month 65.2% 64.0% 64.0% 64.0%
Table 2.3: Performance of
2
H
L versus
2
CC
for dierent lters of f
Estimator
2
P

2
GK

2
RS

2
Y Z
f
c
65.2% 64.0% 64.0% 64.0%
f
ma
64.0% 61.8% 61.8% 64.0%
f
exp
64.0% 61.8% 60.7% 64.0%
f
t
64.0% 61.8% 60.7% 64.0%
2.3 Estimation of realized volatility
The common way to estimate the realized volatility is to estimate the expectation
value of the variance over an observed windows. Then we compute the corresponding
volatility. However, to do so we encounter a great dilemma: taking a long historical
window can help to decrease the estimation error as discussed in the last paragraph
or taking a short historical data allows an estimation of volatility closer to the present
volatility.
In order to overcome this dilemma, we need to have an idea about the dynamics
of the variance
2
t
that we would like to measure. Combining this knowledge on the
dynamics of
2
t
with the committed error on the long historical window, we can nd
out an optimal windows for the volatility estimator. We assume that the variance
follows a simplied dynamics which has been used in the last numerical simulation:
_
dS
t
=
t
S
t
dt +
t
S
t
dB
t
d
2
t
=
2
t
dB
t
in which B
t
is a Brownian motion independent to the one of asset process.
2.3.1 Moving-average estimator
In this section, we show how the optimal window of the moving-average estimator is
obtained via a simple example. Let us consider the canonical estimator:

2
=
1
nT
n
i=1
R
2
t
i
42
Here, the time increment is chosen to be constant t
i
t
i1
= T, then the variance
of this estimator at instant t
n
is:
var
_

2
_
2
4
t
n
T
t
n
t
0
=
2
4
t
n
n
On another hand,
2
t
is now itself a stochastic process, hence its conditional variance
to
2
t
n
gives us the error due to the use of historical observations. We rewrite:
1
t
n
t
0
_
t
n
t
0
2
t
dt =
2
t
n

1
t
n
t
0
_
t
n
t
0
(t t
0
)
2
t
dB
t
then the error due to the stochastic volatility is given by:
var
_
1
t
n
t
0
_
t
n
t
0
2
t
dt
2
t
n
_

t
n
t
0
3

4
t
n
2
=
nT
4
t
n
2
3
The total error of the canonical estimator is simply the sum of these errors due to
the fact that the two considered Brownian motions are supposed to be independent.
We dene the function of total estimation errors as following:
e
_

2
_
=
2
4
t
n
n
+
nT
4
t
n
2
3
In order to obtain the optimal window for volatility estimation, we minimize the
error function e
_

2
_
with respect to nT which leads to the following equation:
4
t
n
2
3

2
4
t
n
n
2
T
= 0
This equation provides a very simple solution nT =
6T/ with the optimal error

is now e
_

2
opt
_
2
_
2T/3
4
t
n
. The major diculty of this estimator is to calibrate
the parameter which is not trivial because
2
t
is an unobservable process. Dierent
techniques can be considered such as the maximum likehood which will be discussed
later.
2.3.2 IGARCH estimator
We discuss now another approach for estimating the realized volatility based on the
IGARCH model. The detail theoretical derivation of the method is given in Drost
F.C. et al. (1993, 1999) It consists of a volatility estimator of the form:

2
t
=
2
tT
+
1
T
R
2
t
where T is a constant increment of estimation . In integrating the recurrence relation
above, we obtain the estimator of the variance IGARCH in function of the return
observed in the past:

2
t
=
1
T
n
i=1
i
R
2
tiT
+
n

2
tnT
(2.6)
43
We remark that the contribution of the last term tends to 0 when n tends to innity.
This estimator again has the form of a weighted average then similar approach as
in the canonical estimator is applicable. Assuming that the volatility follows the
lognormal dynamics described by Equation 2.3, therefore the optimal value of is
given by:
=

_
8T
2
T
2
4
2
T 4
(2.7)
We encounter here again the same question as the canonical case that is how to
calibrate the parameter of the lognormal dynamics. In practice, we proceed in the
inverse way. We seek rst the optimal value
of the IGARCH estimator then use

the inverse relation of equation 2.7 to determine the value of :
=
4
T
(1
)
2
1 +
2
Remark 1 Finally, as insisted at the beginning of this discussion, we would like
to point out that IGARCH estimator can be considered as an exponential weighted
average. We begin rst with a IGARCH estimator with constant time increment.
The expectation value of this estimator is:
E
_

2
t
= E
_
1
T
+
i=1
i
R
2
tiT
_
=
1
T
+
i=1
i
_
tiT+T
tiT
2
u
du
=
1
+
i=1
T
i
+
i=1
i
_
tiT+T
tiT
2
u
du
=
1
+
i=1
Te
iT
+
i=1
e
iT
_
tiT+T
tiT
2
u
du
with = ln /T. In this present form, we conclude that the IGARCH estimator is
a weighted-average of the variance
2
t
with an exponential weight distribution. The
annualized estimator of the volatility can be written as:
E
_

2
t
+
i=1
e
iT
_
tiT+T
tiT

2
u
du
+
i=1
Te
iT
This expression admits a continuous limit when T 0 .
44
2.3.3 Extension to range-based estimators
The estimation of the optimal window in the last discussion can be also generalized to
the case of range-based estimators. The main idea is to obtain the trade-o between
the estimator error (variance of the estimator) and the dynamic volatility described
by the model (2.3). The equation that determines the total error of the estimator is
given by:
e(
2
) = var
_

2
_
+
nT
3

4
t
n
2
Here, we remind that the rst term in this expression is the estimator error coming
from the discrete sum whereas the second term is the error of the stochastic volatility.
In fact, the rst term is already given by the study of various estimators in the last
section. The second term is typically dependent on the choice of volatility dynamics.
Using the notation of the estimator eciency, we rewrite the above expression as:
e(
2
) =
1
e(
2
)
2
4
t
n
n
+
nT
3

4
t
n
2
The minimization procedure of the total error is exactly the same as the last exam-
ple on the canonical estimator, then we obtain the following result of the optimal
averaging window:
nT =
6T
e(
2
)
2
(2.8)
The IGARCH estimator can also be applied for various type of high-low esti-
mator, the extension consists of performing an exponential moving average in stead
of the simple average. The parameter of the exponential moving average will
be determined again by the maximum likehood method as shown in the discussion
below.
2.3.4 Calibration procedure of the estimators of realized volatility
As discussed above, the estimators of realized volatility depend on the choice of the
underlying dynamics of the volatility. In order to obtain the best estimation of the
realized volatility, we must estimate the parameter which characterizes this dynamics.
Two possible approaches to obtain the optimal value of the these estimators are:
using the least square problem which consists to minimize the following objec-
tive function:
n
i=1
_
R
2
t
i
+T
T
2
t
i
_
2
or using the maximum likehood problem which consists to maximize the log-
likehood objective function:
n
2
ln 2
n
i=0
1
2
ln
_
T
2
t
_
i=0
R
2
t
i
+T
2T
2
t
i
45
We remark here that the moving-average estimator depends only on the averaging
window whereas the IGARCH estimator depends only on the parameter . In gen-
eral, there is no way to compare these two estimators if we do not use a specic
dynamics. By this way, the optimal values of both parameters are obtained by the
optimal value of and that oers a direct comparison between the quality of these
two estimators.
Example of realized volatility
We illustrate here how the realized volatility is computed by the two methods dis-
cussed above. In order to illustrate how the optimal value of the averaging window
nT or
are calibrated, we plot the likehood functions of these two estimator for
one value of volatility at a given date. In Figure 2.20, we present the logarithm
of likehood functions for dierent value of . The maximal value of the function
l() gives us the optimal value
which will be used to evaluate the volatility for

the two methods. We remark that the IGARCH estimator is better to estimate the
global maximum because its logarithm likehood is a concave function. For the the
moving-average method, its logarithm likehood function is not smooth and presents
complicated structure with local maximums which is less ecient for the optimization
procedure.
Figure 2.20: Comparison between IGARCH estimator and CC estimator
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4
1670
1675
1680
1685
1690
1695
1700
1705
1710
1715
1720
l
(
)

CC optimal
IGARCH
We now test the implementation of IGARCH estimators for various high-low
estimators. As we have demonstrated that the IGARCH estimator is equivalent to
46
exponential moving-average, then the implementation for high-low estimators can be
set up in the same way as the case of close-to-close estimator. In order to determine
the optimal parameter
, we perform an optimization scheme on the logarithm like-

hood function. In Figure 2.21, we present the comparison of the logarithm likehood
function between dierent estimators in function of the parameter . The optimal
parameter
of each estimator corresponds to the maximum of the logarithm like-

hood function. In order to have a clear idea about the corresponding size of the
Figure 2.21: Likehood function of high-low estimators versus ltered parameter
0.7 0.75 0.8 0.85 0.9 0.95 1
1440
1445
1450
1455
1460
1465
1470
1475
1480
1485
1490
l
(
)

CC, OC, P, GK, RS, YZ
moving-average window to the optimal parameter
, we use the formula (2.7) to

eectuate the conversion. The result is reported in the Figure 2.22 below.
Backtest on the voltarget strategy
We take historical data of S&P 500 index over the period since 01/2001 to 12/2011
and the averaging window of the close-to-close estimator is chosen as n = 25. In
Figure2.23, we show the dierent estimations of realized volatility.
In order to test the eciency of these realized estimators (moving-average and
IGARCH), we rst evaluate the likehood function for the close-to-close estimator
and realized estimators then apply these estimators for the voltarget strategy as
performed in the last section. In Figure 2.25, we present the value of likehood
function over the period from 01/2001 to 12/2010 for three estimators: CC, CC
optimal (moving-average) and IGARCH. The estimator corresponding to the highest
value of the likehood function is the one that gives the best prediction of the volatility.
47
Figure 2.22: Likehood function of high-low estimators versus eective moving window
0 10 20 30 40 50 60 70 80 90
1440
1445
1450
1455
1460
1465
1470
1475
1480
1485
n
l
(
n
)

Figure 2.23: IGARCH estimator versus moving-average estimator for close-to-close
prices
01/2001 01/2003 01/2005 01/2007 01/2009 01/2011
20
40
60
80
100
(
%
)

CC
CC optimal
IGARCH
48
Figure 2.24: Comparison between dierent IGARCH estimators for high-low prices
01/2001 01/2003 01/2005 01/2007 01/2009 01/2011
10
20
30
40
50
60
70
80
90
(
%
)

CC, CO, P, GK, RS, YZ
Figure 2.25: Daily estimation of the likehood function for various close-to-close esti-
mators
01/2001 01/2003 01/2005 01/2007 01/2009 01/20011
1300
1400
1500
1600
1700
1800
1900
l
(
)

CC
CC optimal
CC IGARCH
49
Figure 2.26: Daily estimation of the likehood function for various high-low estimators
01/2001 01/2003 01/2005 01/2007 01/2009 01/2011
1200
1300
1400
1500
1600
1700
1800
1900
l
(
)

In Figure 2.27, the result of the backtest on voltarget strategy is performed for
the three considered estimators. The estimators which dynamical choice of averaging
parameters always give better result than a simple close-to-close estimator with xed
averaging window n = 25. We next backtest on the IGARCH estimator applied on
the high-low price data, the comparison with IGARCH applied on close-to-close data
is shown in Figure 2.28. We observe that the IGARCH estimator for close-to-close
price is one of the estimators which produce the best backtest.
2.4 High-frequency volatility estimators
We have discussed in the previous sections how to measure the daily volatility based
on the range of the observed prices. If more information is available in the trading
data like having all the real-time quotation, can one estimate more accurately the
volatility? As far as the trading frequency increases, we expect that the precision of
estimator get better as well. However, when the trading frequency reaches certain
limit, new phenomenon coming from the non-equilibrium of the market emerges
and spoils the precision. This limit denes the optimal frequency for the classical
estimator. In the literature, it is more and less agree to be at the frequency of one
trade every 5 minutes. This phenomenon is called the micro-structure noise which
are characterized by the bid-ask spread or the transaction eect. In this section,
we will summarize and test some recent proposals which attempt to eliminate the
micro-structure noise.
50
Figure 2.27: Backtest for close-to-close estimator and realized estimators
01/2001 01/2003 01/2005 01/2007 01/2009 01/2011
0.6
0.8
1
1.2
1.4

S&P 500
CC
CC optimal
CC IGARCH
Figure 2.28: Backtest for IGARCH high-low estimators comparing to IGARCH close-
to-close estimator
01/2001 01/2003 01/2005 01/2007 01/2009 01/2011
0.6
0.8
1
1.2
1.4

S&P 500, CC, OC, P, GK, RS, YZ
51
2.4.1 Microstructure eect
It has been demonstrated in the nancial literature that the realized return estimator
is not robust when the sampling frequency is too high. Two possible explanations
of this eect the following. In the probabilistic point of view, this phenomenon
comes from the fact that the cumulated return (or the logarithm of price) is not a
semimartingal as we assumed in the last section. However, it emerges only in the
short time scale when the trading frequency is high enough. In the nancial point of
view, this eect is explained by the existence of the so-called market microstructure
noises. These noises come from the existence of the bid-ask spread. We now discuss
the simplest model which includes the mircrostruture noise as an independent noise
to the underlying Brownian motion. We assume that the true cumulated return is
an unobservable process and follows a Brownian motion:
dX
t
=
_

2
t
2
_
dt +
t
dB
t
The observed signal Y
t
is the cumulated return which is perturbed by the microstruc-
ture noise
t
:
Y
t
= X
t
+
t
For the sake of symplicity, we use the following assumptions:
(i)
t
i
is iid with E[
t
i
] = 0 and E
_
2
t
i
= E
_
(ii)
t
B
t
From these assumptions, we see immediately that the volatility estimator based on
historical data Y
t
i
is biased:
var(Y ) = var(X) +E
_
The rst term var(X) is scaled as t (estimation horizon) and E

_
is constant,
this estimator can be considered as unbiased if the time horizon is large enough
(t > E
_
/
2
). At high frequency, the second term is not negligible and better
estimator must be able to eliminate this term.
2.4.2 Two time-scale volatility estimator
Using dierent time scales to extract the true volatility of the hidden price process
(without noise) is both independently proposed by Zhang et al. (2005) and Bandi
et al. (2004). In this paragraph, we employ the approach in the rst reference to
dene the intra-day volatility estimator. We prefer here discussing the main idea of
this method and its practical implementation rather than all the detail of stochastic
calculus concerning the expectation value and the variance of the realized return
3
.
3
Detail of the derivation of this technique can be found in Zhang et al. (2005).
52
Denitions and notations
In order to x the notations, let us consider a time-period [0, T] which is divided in
to M1 intervals (M can be understood as the frequency). The quadratic variation
of the Bronian motion over this period is denoted:
X, X
T
=
_
T
0
2
t
dt
For the discretized version of the quadratic variation, we employ the [., .] notation:
[X, X]
T
=
t
i
,t
i+1
[0,T]
_
X
t
i+1
X
t
i
_
2
Then the habitual estimator of realized return over the interval [0, T] is given by:
[Y, Y ]
T
=
t
i
,t
i+1
[0,T]
_
Y
t
i+1
Y
t
i
_
2
We remark that the number of points in the interval [0, T] can be changed. In fact,
the expectation value of the quadratic variation should not depend on the distribution
of points in this interval. Let us dene the ensemble of points in one period as a grid
(:
( = t
0
, . . . , t
M
Then a subgrid 1 is dened as:

1 = t
k
1
, . . . , t
k
m
where (t
k
j
) with j = 1, . . . m is a subsequence of (t
i
) with i = 1, . . . M. The number
of increments is denoted as:
[1[ = card (1) 1
With these notations, the quadratic variation over a subgrid 1 reads:
[Y, Y ]
H
T
=
t
k
i
,t
k
i+1
H
_
Y
t
k
i+1
Y
t
k
i
_
2
The realized volatility estimator over the full grid
If we compute the quadratic variation over the full grid ( which means that at highest
frequency. As discussed above, it is not surprising that it will suer the most eect
of the microstructure noise:
[Y, Y ]
G
T
= [X, X]
G
T
+ 2 [X, ]
G
T
+ 2 [, ]
G
T
Under the hypothesis of the microstructure noise, the conditional expectation value
of this estimator is equal to:
E
_
[Y, Y ]
G
T
X
_
= [X, X]
G
T
+ 2ME
_
53
and the variation of the estimator:
var
_
[Y, Y ]
G
T
X
_
= 4ME
_
+
_
8 [X, X]
G
T
E
_
2var
_
2
_
_
+O(n
1/2
)
In these two expressions above, the sums are arranged order by order. In the limit
M , we obtain the habitual result of central limit theorem:
M
1/2
_
[Y, Y ]
G
T
2ME
_
_
L
2
_
E
_
4
_
1/2
A(0, 1)
Hence, as M increases, [Y, Y ]
G
T
becomes a good estimator of the microstructure noise
and we denote:
E[
2
] =
1
2M
[Y, Y ]
G
T
The central limit theorem for this estimator states:
M
1/2
_
E[
2
] E
_
_
L
_
E
_
4
_
1/2
A(0, 1) as M
The realized volatility estimator over subgrid
As we mentioned in the last discussion, increasing the frequency will spoil the esti-
mation of the volatility due to the presence of the microstructure noise. The naive
solution is to reduce the number of point in the grid or to consider only a subgrid,
then one can take the average over a number choice of subgrids. Let us consider a
subgrid 1 with [1[ = m1, then the same result as for the full grid can be obtained
in replacing M by m:
E
_
[Y, Y ]
H
T
X
_
= [X, X]
H
T
+ 2mE
_
Let us now consider a sequence of subgrids 1

(k)
with k = 1 . . . K which satises
( =
K
k=1
1
(k)
and 1
(k)
1
(l)
= with k ,= l. By averaging over these K subgrid,
we obtain the result:
E
_
[Y, Y ]
avg
T
X
_
=
1
K
K
k=1
[Y, Y ]
H
(k)
T
We dene the average length of the subgrid m = (1/K)
K
k=1
m
k
, then the nal
expression is:
E
_
[Y, Y ]
avg
T
X
_
= [X, X]
avg
T
+ 2mE
_
This estimator of volatility is still biased and the precision depends strongly on the
choice of the length of subgrid and the number of subgrids. In the paper of Zhang et
al., the authors have demonstrated that there exists an optimal value K
for which
we can reach the best performance of estimator.
54
Two time-scale estimator
As the full-grid averaging estimator and the subgrid averaging estimator both contain
the same component coming from the microstructure noise to a factor, we can employ
both estimators to have a new one where the microstructure noise can be completely
eliminated. Let us consider the following estimator:

2
ts
=
_
1
m
M
_
1
_
[Y, Y ]
avg
T

m
M
[Y, Y ]
G
T
_
This estimator now is an unbiased estimator with its precision determined by the
choice of K and m. In the theoretical framework, this optimal value is given as a
function of the noise variance and the forth moment of the volatility. In practice, we
employ a scan over the number of the subgrid of size m M/K in order to look for
the optimal estimator.
2.4.3 Numerical implementation and backtesting
We now backtest the proposed technique on the S&P 500 index with the choice of
the sub grid as following. The full grid is dened by the ensemble of data every
minute from the opening to the close of trading days (9h to 17h30). Data is taken
since the 1st February 2011 to the 6th June 2011. We denote the full grid for each
trading day period:
( = t
0
, . . . , t
M
and the subgrid is chosen as following:

1
(k)
= t
k1
, t
k1+K
. . . , t
k1+n
k
K
where the indice k = 1, . . . , K and n

k
is the integer making t
k1+n
k
K
the last element
in 1
(k)
. As we can not compute exactly the value of the optimal value K
for each
trading period, we employ an iterative scheme which tends to converge to the optimal
value. Analytical expression of K
is given by Zhang et al.:

K
=
_
12
_
E
_
2
_
2
TE
2
_
1/3
M
2/3
where is given by the expression:
2
=
_
T
0
4
t
dt
In the rst approximation, we consider the case where the intraday volatility is
constant then the expression of cans be simplied to
2
= T
4
. In Figure 2.29, we
present the result of the intraday volatility which takes into account only the trading
day for the S&P 500 index under the assumption of constant volatility. The two-
time scale estimator reduces the eect of microstructure noise eect on the realized
volatility computed over the full grid.
55
Figure 2.29: Two-time scale estimator of intraday volatility
02/11 03/11 04/11 05/11 06/11
0
5
10
15
20
25
30
35
(
%
)

Volatility with full grid
Volatility with subgrid
Volatility with two scales
2.5 Conclusion
Voltarget strategies are ecient ways to control the risk for building trading strate-
gies. Hence, a good estimator of the volatility is essential from this perspective. In
this paper, we show that we can use the data rang to improve the forecasting of the
volatility of the market. The use of high and low prices is less important for the index
as it gives more and less the same result with traditional close-to-close estimator.
However, for independent stock with higher volatility level, the high-low estimators
improves the prediction of volatility. We consider several backtests on the S&P 500
index and obtain competing results with respect to the traditional moving-average
estimator of volatility.
Indeed, we consider a simple stochastic volatility model which permit to integrate
the dynamics of the volatility in the estimator. An optimization scheme via the
maximum likehood algorithm allows us to obtain dynamically the optimal averaging
window. We also compare these results for rang-based estimator with the well-
known IGARCH model. The comparison between the optimal value of the likehood
functions for various estimators gives us also a ranking of estimation error.
Finally, we studied the high frequency volatility estimator which is a very active
topic of nancial mathematics. Using simple model proposed by Zhang et al, (2005),
we show that the microstructure noise can be eliminated by the two time scale
estimator.
56
Bibliography
[1] Bandi F. M. and Russell J. R. (2006), Saperating Microstructure Noise from
Volatility Journal of Financial Economics, 79, pp. 655-692.
[2] Drost F. C. and Nijman T. E. (1993), Temporal Aggregation of GARCH
Processes Econometrica, 61, pp. 909-927.
[3] Drost F. C. and Werker J. M. (1999), Closing the GARCH gap: Continuous
time GARCH modeling Journal of Econometrics, 74, pp. 31-57 .
[4] Feller W. (1951), The Asymptotic Distribution of the Range of Sums of Inde-
pendent Random Variables, Annals of Mathematical Statistics, 22, pp. 427-432.
[5] Garman M. B. and Klass M. J. (1980), On the estimation of security price
from historical data, Journal of Business, 53, pp. 67-78.
[6] Kunimoto N. (1992), Improving the Parkinson method of estimating security
price volatilities, Journal of Business, 65, pp. 295-302.
[7] Parkinson M. (1980), The extreme value method for estimating the variance
of the rate of return, Journal of Business, 53, pp. 61-65.
[8] Rogers L. C. G. and Satchell S. E. (1991), Estimating variance form high,
low and closing prices, Annals of Applied Probability 1, pp. 504-512.
[9] Yang D. and Zhang Q. (2000), Drift-Independent Volatility Estimation Based
on High, Low, Open and Close Prices, Journal of Business, 73, pp. 477-491.
[10] Zhang L., Mykland P. A. and Ait-Sahalia Y. (2005), A Tale of Two Time
Scales: Determining Integrated Volatility With Noisy High-Frequency Data
Journal of the American Statistical Association, 100(472), pp. 1394-1411.
57
Chapter 3
Support Vector Machine in
Finance
In this chapter, we review in the well-known machine learning technique so-called
support vector machine (SVM). This technique can be employed in dierent contexts
such as classication, regression or density estimation according to Vapnik [1998].
Within this paper, we would like rst to give an overview on this method and its
numerical variation implementation, then bridge it to nancial applications such as
the stock selection.
Keywords:Machine learning, Statistical learning, Support vector machine, regres-
sion, classication, stock selection.
3.1 Introduction
Support vector machine is an important part of the Statistical Learning Theory. It
was rst introduced in the mid-90 by Boser et al., (1992) and contributes important
applications for various domains such as pattern recognition (for example: handwrit-
ten, digit, image), bioinformatic e.t.c. This technique can be employed in dierent
contexts such as classication, regression or density estimation according to Vapnik
[1998]. Recently, dierent applications in the nancial eld have been developed via
two main directions. The rst one employs SVM as non-linear estimator in order
to forecast the market tendency or volatility. In this context, SVM is used as a re-
gression technique with feasible possibility for extension to non-linear case thank to
the kernel approach. The second direction consists of using SVM as a classication
technique which aims to elaborate the stock selection in the trading strategy (for ex-
ample long/short strategy). In this paper, we review the support vector machine and
its application in nance in both points of view. The literature of this recent eld
is quite diversied and divergent with many approaches and dierent techniques.
We would like rst to give an overview on the SVM from its basic construction to
all extensions including the multi classication problem. We next present dierent
numerical implementations, then bridge them to nancial applications.
59
Support Vector Machine in Finance
This paper is organized as following. In Section 2, we remind the framework of
the support vector machine theory based on the approach proposed in O.Chapelle
(2002). We next work out various implementations of this technique from both both
primal and dual problems in Section 3. The extension of SVM to the case of multi
classication is discussed in Section 4. We nish with the introduction of SVM in
the nancial domain via an example of stock selection in Sections 5 and 6.
3.2 Support vector machine at a glance
We attempt to give an overview on the support vector machine method in this section.
In order to introduce the basic idea of SVM, we start with the rst discussion on the
classication method via the concept of hard margin an soft margin classication. As
the work pioneered by Vapnik and Chervonenkis (1971) has established a framework
for Statistical Learning Theory, so-called VC theory , we would like to give a brief
introduction with basic notation and the important Vapnik-Chervonenkis theorem
for Empirical Risk Minimization principle (ERM). Extension of ERM to Vicinal Risk
Minimization (VRM) will be also discussed.
3.2.1 Basic ideas of SVM
We illustrate here the basic ideas of SVM as a classication method. The main
advantage of SVM is that it can be not only described very intuitively in the con-
text of linear classication but also extended in an intelligent way to the non-linear
case. Let us dene the training data set consisting of pairs of input/output points
(x
i
, y
i
), with 1 i n. Here the input vector x
i
belongs to some space A whereas
the output y
i
belongs to 1, 1 in the case of bi-classication. The output y
i
is
used to identify the two possible classes.
Hard margin classication
The most simple idea of linear classication is to look at the whole set of input
x
i
A and search the possible hyperplane which can separate the data in two
classes based on its label y
i
= 1. Its consists of constructing a linear discriminant
function of the form:
h(x) = w
T
x +b
where the vector w is the weight vector and b is called the bias. The hyperplane is
dened by the following equation:
H = x : h(x) = w
T
x +b = 0
This hyperplane divides the space A into two regions: the region where the discrimi-
nant function has positive value and the region with negative value. The hyperplane
is the also called the decision boundary. The linear classication comes from the fact
that this boundary depends on the data in the linear way.
60
Figure 3.1: Geometric interpretation of the margin in a linear SVM.
We now dene the notion of a margin. In Figure 3.1 (reprinted from Ben-Hur
A. et al., 2010), we give a geometric interpretation of the margin in a linear SVM.
Let x
+
and x
be the closest points to the hyperplane from the positive side and
negative side. The cycle data points are dened as the support vectors which are
the closest points to the decision boundary (see Figure 3.1). The vector w is the
normal vector to the hyperplane and we denote its norm |w| =
w
T
w and its
direction w = w/|w|. We assume that x
+
and x
are equidistant from the decision

boundary. They determine the margin from which the two classes of points of data
set T are separated:
m
D
(h) =
1
2
w
T
(x
+
x
)
In the geometric consideration, this margin is just half of the distant between two
closest points from both sides of the hyperplane H projected in the direction w.
Using the equations that dene the relative positions of these points to the hyperplane
H:
h(x
+
) = w
T
x
+
+b = a
h(x
) = w
T
x
+b = a
where a > 0 is some constant. As the normal vector w and the bias b are undeter-
mined quantity, we can simply divide them by a and renormalized all these equations.
This is equivalent to set a = 1 in the above expression and we nally get
m
D
(h) =
1
2
w
T
(x
+
x
) =
1
|w|
61
The basic idea of maximum margin classier is to determine the hyperplane which
maximizes the margin. For a separable dataset, we can dene the hard margin SVM
as the following optimization problem:
min
w,b
1
2
|w|
2
(3.1)
u.c. y
i
_
w
T
x
i
+b
_
> 1 i = 1...n
Here, y
i
_
w
T
x
i
+b
_
> 1 is just a compact way to express the relative position of two
classes of data points to the hyperplane H. In fact, we have w
T
x
i
+ b > 1 for the
class y
i
= 1 and w
T
x
i
+b < 1 for the class y
i
= 1.
The historical approach to solve this quadratic program is to map the primal
problem to dual problem. We give here the main result while the detailed derivation
can be found in the Appendix C.1. Via KKT theorem, this approach gives us the
following optimized solution (w
, b
):
w
=
n
i=1
i
y
i
x
i
where
= (
1
, . . . ,
n
) is the solution of the dual optimization problem with dual
variable = (
1
, . . . ,
n
) of dimension n:
max
i=1
1
2
n
i,j=1
j
y
i
y
j
x
T
i
x
j
u.c.
i
0 i = 1...n
We remark that the above optimization problem is a quadratic program in the
vectorial space R
d
with n linear inequality constraints. It may become meaningless
if it has no solution (the dataset is inseparable) or too many solutions (stability of
boundary decision on data). The questions on the existence of a solution in Prob-
lem 3.5 or on the sensibility of solution on dataset are very dicult. A quantitative
characterization can be found in the next discussion on the framework of Vapnik-
Chervonenskis theory. We will present here an intuitive view of this problem which
depends on two main factors. The rst one is the dimension of the space of func-
tion h(x) which determines the decision boundary. In the linear case, it is simply
determined by the dimension of the couple (w, b). If the dimension of this function
space is two small as in the linear case, it is possible that there exists no linear so-
lution or the dataset can not be separated by a simple linear classier. The second
factor is the number of data points which involves in the optimization program via
n inequality constraints. If the number of constraints is too large, the solution may
not exist neither. In order to overcome this problem we must increase the dimension
of the optimization problem. There exists two possible ways to do this. The rst
one consists of relaxing the inequality constrains by introducing additional variables
which aim to tolerate the strict separation. We will allow the separation with cer-
tain error (some data points in the wrong side). This technique is introduced rst by
62
Cortes C. and Vapnik V. (1995) under the name Soft margin SVM. The second one
consists of using the non-linear classier which directly extend the function space to
higher dimension. The use of non-linear classier can increase rapidly the dimension
of the optimization problem which invokes a computation problem. An intelligent
way to get over is employing the notion of kernel. In the next discussions, we will try
to clarify these two approaches then nish this section by introducing two general
frameworks of this learning theory.
Soft margin classication
In fact, the inequality constrains described above y
i
_
w
T
x
i
+b
_
> 1 ensure that all
data points will be well classied with respect to the optimal hyperplane. As the data
may be inseparable, an intuitive way to overcome is relaxing the strict constrains by
introducing additional variables
i
with i = 1, . . . , n so-called slack variables. They
allow to commit certain error in the classication via new constrains:
y
i
_
w
T
x
i
+b
_
> 1
i
i = 1...n (3.2)
For
i
> 1, the data point x
i
is completely misclassied whereas 0
i
1 can be
interpreted as margin error. By this denition of slack variables,

n
i=1
i
is directly
related to the number of misclassied points. In order to x our expected error in the
classication problem, we introduce an additional term C
n
i=1
p
i
in the objective
function and rewrite the optimization problem as following:
min
w,b,
1
2
|w|
2
+C
n
i=1
i
(3.3)
u.c. y
i
_
w
T
x
i
+b
_
1
i
,
i
0 i = 1...n
Here, C is the parameter used to x our desired level of error and p 1 is an usual
way to x the convexity on the additional term
1
. The soft-margin solution for the
SVM problem can be interpreted as a regularization technique that one can nd
dierent optimization problem such as regression, ltering or matrix inversion. The
same result can be found with regularization technique later when we discuss the
possible use of kernel.
Before switching to next discussion on the non-linear classication with kernel
approach, we remark that the soft margin SVM problem is now at higher dimension
d + 1 + n. However, the computation cost will be not increased. Thank to the
KKT theorem, we can turn this primal problem to a dual problem with more simple
constrains. We can also work directly with the primal problem by eectuating a
trivial optimization on . The primal problem is now no longer the a quadratic
program, however it can be solved by Newton optimization or conjugate gradient as
demonstrated in Chapelle O. (2007).
1
It is equivalent to dene a L
p
norm on the slack vector R
n
63
Non-linear classication, Kernel approach
The second approach to improve the classication is to employ the non-linear SVM.
In the context of SVM, we would like to insist that the construction of non-linear
discriminant function h(x) consists of two steps. We rst extend the data space
A of dimension d to a feature space T with higher dimension N via a non-linear
transformation : A T, then a hyperplane will be constructed in the feature
space T as presented before:
h(x) = w
T
(x) +b
Here, the result vector z = (z
1
, . . . , z
N
) = (x) is N-component vector in T space,
hence w is also a vector of size N. The hyperplane H = z : w
T
z + b = 0 dened
in T is no longer a linear decision boundary in the initial space A:
B = x : w
T
(x) +b = 0
At this stage, the generalization to non-linear case helps us to avoid the problem
of overtting or undertting. However, a computation problem emerges due to the
high dimension of the feature space. For example, if we consider an quadratic trans-
formation, it can lead to a feature space of dimension N = d(d + 3)/2. The main
question is how to construct the separating hyperplane in the feature space? The
answer to this question is to employ the mapping to the dual problem. By this
way, our N-dimension problem turn again to the following n-dimension optimization
problem with dual variable :
max
i=1
1
2
n
i,j=1
j
y
i
y
j
(x
i
)
T
(x
j
)
u.c.
i
0 i = 1...n
Indeed, the expansion of the optimal solution w
has the following form:

w
=
n
i=1
i
y
i
(x
i
)
In order to solve the quadratic program, we do not need the explicit form of the
non-linear application but only the kernel of the form K (x
i
, x
j
) = (x
i
)
T
(x
j
)
which is usually supposed to be symmetric. If we provide only the kernel K (x
i
, x
j
)
for the optimization problem, it is enough to construct later the hyperplane H in
the feature space T or the boundary decision in the data space A. The discriminant
function can be computed as following thank to the expansion of the optimal w
on
the initial data x
i
i = 1, . . . , n:
h(x) =
n
i=1
i
y
i
K (x, x
i
) +b
From this expression, we can construct the decision function which can be used to
classied a given input x as f (x) = sign (h(x)).
64
For a given non-linear function (x), we can compute the kernel K (x
i
, x
j
) via
the scalar product of two vector in T space. However, the reciprocal result does not
stay unless the kernel satises the condition of the Mercers theorem (1909). Here, we
study some standard kernel which are already widely used in the pattern recognition
domain:
i. Polynomial kernel: K (x, y) =
_
x
T
y + 1
_
p
ii. Radial Basis kernel: K (x, y) = exp
_
|x y|
2
/2
2
_
iii. Neural Network kernel: K (x, y) = tanh
_
ax
T
y b
_
3.2.2 ERM and VRM frameworks
We nish the review on SVM by discussing briey on the general framework of
Statistical Learning Theory including the SVM. Without enter into the detail like
the important theorem of Vapnik-Chervonenkis (1998), we would like to give a more
general view on the SVM by answering some questions like how to approach SVM as
a regression, how to interpret the soft-margin SVM as a regularization technique...
Empirical Risk Minimization framework
The Empirical Risk Minimization framework was studied by Vapnik and Chervo-
nenkis in the 70s. In order to show the main idea, we rst x some notations. Let
(x
i
, y
i
), 1 i n be the training dataset of pairs input/output. The dataset is
supposed to be generated i.i.d from unknown distribution P(x, y). The dependency
between the input x and the output y is characterized in this distribution. For ex-
ample, if the input x has a distribution P (x, y) and the out put is related to x via
function y = f (x) which is altered by a Gaussian noise A
_
0,
2
_
, then P (x, y) reads
P (x, y) = P (x) A
_
f (x y) ,
2
_
We remark in this example that if 0 then A
_
0,
2
_
tends to a Dirac distribution
which means that the relation between input and output can be exactly determined
by the maximum position of the distribution P (x, y). Estimating the function f (x)
is fundamental. In order to give measurement of the estimation quality, we compute
the expectation value of the loss function with respect to the distribution P(x, y).
We dene here the loss function in two dierent contexts:
1. Classication: l (f (x) , y) = I
f(x)=y
where I is the indicator function.
2. Regression: l (f (x) , y) = (f (x) y)
2
The objective of statistical learning is to determine the function f in the a certain
function space T which minimizes the expected loss or the risk objective function:
R(x) =
_
l (f (x) , y) dP(x, y)
65
As the distribution P(x, y) is unknown then the expected loss can not be evaluated.
However, with available training dataset x
i
, y
i
, one could compute the empirical
risk as following:
R
emp
=
1
n
n
i=1
l (f (x
i
) , y)
In the limit of large dataset n , we expect the convergence: R
emp
(f) R(f)
for all tested function f thank to the law of large number. However, does the learning
function f which minimizes R
emp
(f) is the one minimizing the true risk R(f)? The
answer to this question is NO. In general, there is innite number of function f which
can learn perfectly the training dataset f (x) = y
i
i. In fact, we have to restraint
the function space T in order to ensure the uniform convergence of the empirical
risk to the true risk. The characterization of the complexity of a space of function T
was rst studied in the VC theory via the concept of VC dimension (1971) and the
important VC theorem which gives an upper bound of the convergence probability
P sup f T [R(f) R
emp
(f)[ > 0.
A common way to restrict the function space is to impose a regularization condi-
tion. We denote (f) as a measurement of regularity, then the regularized problem
consists of minimizing the regularized risk:
R
reg
(f) = R
emp
(f) +(f)
Here is the regularization parameter and (f) can be for example the L
p
norm on
some deviation of f.
Vapnik and Chervonenkis theory
We are not going to discuss in detail the VC theory on the statistical learning machine
but only recall the most important result concerning the characterization of the
complexity of function class. In order to well quantify the trade-o between the
overt problem and the inseparable data problem, Vapnik and Chervonenkis have
introduced a very important concept which is the VC dimension and the important
theorem which characterize the convergence of empirical risk function. First, the VC
dimension is introduced to measure the complexity of the class of functions T
Denition 3.2.1 The VC dimension of a class of functions T is dened as the
maximum number of point that can be exactly learned by a function of T:
h = max
_
[X[, X A, such that b 1, 1
|X|
, f T x
i
X, f (x
i
) = b
i
_
(3.4)
With the denition of the VC dimension, we now present the VC theorems which is
a very powerful tool with control the upper limit of the convergence for the empirical
risk to the true risk function. These theorems allows us to have a clear idea about
the superior boundary on the available information and the number of observation in
the training set n. By satisfying this theorem, we can control the trade-o between
overt and undert. The relation between factors or coordinates of vector x and VC
dimension is given in the following theorem:
66
Theorem 3.2.2 (VC theorem of hyperplanes) Let T be the set of hyperplanes in R
d
:
T =
_
x sign
_
w
T
x +b
_
, w R
d
, b R
_
then VC dimension is d + 1
This theorem gives the explicit relation between the VC dimension and the number
of factors or the number of coordinates in the input vector of the training set. It
can be used in the next theorem in order to evaluate the necessary information for
having a good classication or regression.
Theorem 3.2.3 (Vapnik and Chervonenskis) let T be a class of function of VC
dimension h, then for any distribution Pr and for any sample (x
i
, y
i
)
i=1 n
drawn
from this distribution, the following inequality holds true:
Pr
_
sup
fF
[R(f) R
emp
(f)[ >
_
< 4 exp
_
h
_
1 + ln
2n
h
_
_

1
n
_
2
n
_
An important corollary of the VC theorem is the upper bound for the convergence
of the empirical risk function to the risk function:
Corollary 3.2.4 Under the same hypothesis of the VC theorem, the following in-
equality is hold with the probability 1 :
f T, R(f) R
emp
(f)
h
_
ln
2n
h
+ 1
_
ln
4
n
+
1
n
We will skip all the proofs of these theorems and postpone the discussion on the
importance of VC theorems important for practical use later in Section 6 as the
overt and undert problems are very present in any nancial applications.
Vicinal Risk Minimization framework
Vicinal Risk Minimization framework (VRM) was formally developed in the work of
Chapelle O. (2000s). In EVM framework, the risk is evaluated by using empirical
probability distribution:
dP
emp
(x, y) =
1
n
n
i=1
x
i
(x)
y
i
(y)
where
x
i
(x),
y
i
(y) are Dirac distributions located at x
i
and y
i
respectively. In the
VRM framework, instead of dP
emp
, the Dirac distribution is replaced by an estimate
density in the vicinity of x
i
:
dP
vic
(x, y) =
1
n
n
i=1
dP
x
i
(x)
y
i
(y)
67
Hence, the vicinal risk is then dened as following:
R
vic
(f) =
_
l (f (x) , y) dP
vic
(x, y) =
1
n
n
i=1
_
l (f (x) , y
i
) dP
x
i
(x)
In order to illustrate the dierent between the ERM framework and VRM framework,
let us consider the following example of the linear regression. In this case, our loss
function l (f (x) , y) = (f (x) y)
2
where the learning function is of the form f (x) =
w
T
x+b. Assuming that the vicinal density probability dP
x
i
(x) is approximated by
a white noise of variance
2
. The vicinal risk is calculated as following:
R
vic
(f) =
1
n
n
i=1
_
(f (x) y
i
)
2
dP
x
i
(x)
=
1
n
n
i=1
_
(f (x
i
+) y
i
)
2
dA
_
0,
2
_
=
1
n
n
i=1
(f (x
i
) y
i
)
2
+
2
|w|
2
It is equivalent to the regularized risk minimization problem: R
vic
(f) = R
emp
(f) +
2
|w|
2
of parameter
2
with L
2
penalty constraint.
3.3 Numerical implementations
In this section, we discuss explicitly the two possible ways to implement the SVM
algorithm. As discussed above, the kernel approach can be applied directly in the
dual problem and it leads to a simple form of an quadratic program. We discuss rst
the dual approach for the historical reason. Direct implementation for the primal
problem is little bit more delicate that why it was much more later implemented by
Chapelle O. (2007) by Newton optimization method and conjugate gradient method.
According to Chapelle O., in term of complexity both approaches propose more and
less the same eciency while in some context the later gives some advantage on the
solution precision.
3.3.1 Dual approach
We discuss here in more detail the two main applications of SVM which are the clas-
sication problem and the regression problem within the dual approach. The reason
for the historical choice of this approach is simply it oers a possibility to obtain
a standard quadratic program whose numerical implementation is well-established.
Here, we summarize the result presented in Cortes C. and Vapnik V. (1995) where
they introduced the notion of soft-margin SVM. We next discuss the extension for
the regression.
68
Classication problem
As introduced in the last section, the classication encounters two main problems:
the overtted problem and the undertted problem. If the dimension of the function
space is two large, the result will be very sensible to the input then a small change in
the data can cause an instability in the nal result. The second one consists of non-
separable data in the sense that the function space is too small then we can not obtain
a solution which minimizes the risk function. In both case, regularization scheme is
necessary to make the problem well-posed. In the rst case, on should restrict the
function space by imposing some condition and working with some specic function
class (linear case for example). In the later case, on needs to extend the function
space by introducing some tolerable error (soft-margin approach) or working with
non-linear transformation.
a) Linear SVM with soft-margin approach
In the work of Cortes C. and Vapnik V. (1995), they have rst introduced
the notion of soft-margin by accepting that there will be some error in the
classication. They characterize this error by additional variables
i
associated
to each data points x
i
. These parameters intervene in the classication via the
constraints. For a given hyperplane, the constrain y
i
_
w
T
x
i
+b
_
1 which
means that the point x
i
is well-classied and is out of the margin. When we
change this condition to y
i
_
w
T
x
i
+b
_
1
i
with
i
0 i = 1...n, it allow
rst to point x
i
to be well-classied but in the margin for 0
i
< 1. For the
value
i
> 1, there is a possibility that the input x
i
is misclassied. As written
above, the primal problem becomes an optimization with respect to the margin
and and the total committed error.
min
w,b,
1
2
|w|
2
+C.F
_
n
i=1
p
i
_
u.c. y
i
_
w
T
x
i
+b
_
1
i
,
i
0 i = 1...n
Here, p is the degree of regularization. We remark that only for the choice of
p 1 the a soft-margin can have an unique solution. The function F (u) is
usually chosen as a convex function with F (0) = 0, for example F (u) = u
k
.
In the following we consider two specic cases: (i) Hard-margin limit with
C = 0; (ii) L
1
penalty with F (u) = u, p = 1. We dene the dual vector
= (
1
, . . . ,
n
) and the output vector y = (y
1
, . . . , y
n
). In order to write
the optimization problem in vectorial form, we dene as well the operator
D = (D
ij
)
nn
with D
ij
= y
i
y
j
x
T
i
x
j
.
i. Hard-margin limit with C = 0. As shown in Appendix C.1.1, this problem
can be mapped to the following dual problem:
max
T
1
1
2
T
D (3.5)
u.c.
T
y = 0, 0
69
ii. L
1
penalty with F (u) = u, p = 1. In this case the associated dual problem
is given by:
max
T
1
1
2
T
D (3.6)
u.c.
T
y = 0, 0 C1
The full derivation is given in Appendix C.1.2.
Remark 2 For the case with L
2
penalty (F (u) = u, p = 2), we will demon-
strate in the next discussion that it is a special case of kernel approach for
hard-margin case. Hence, the dual problem is written exactly the same as hard-
margin case with an additional regularization term 1/2C added to the matrix
D:
max
T
1
1
2
T
_
D+
1
2C
I
_
(3.7)
u.c.
T
y = 0, 0
b) Non-linear SVM with Kernel approach
The second possibility to extend the function space is to employ a non-linear
transformation (x) from the initial space A to the feature space T then
construct the hard margin problem. This approach conducts to the following
dual problem with the use of an explicit kernel K (x
i
, x
j
) = (x
i
)
T
(x
j
) in
stead of x
T
i
x
j
. In this case, the D operator is a matrix D = (D
ij
)
nn
with
element:
D
ij
= y
i
y
j
K (x
i
, x
j
)
With this convention, the two rst quadratic programs above can be rewritten
in the context of non-linear classication by replacing D operator by this new
denition with the kernel.
We nally remark that, the case of soft-margin SVM with quadratic penalty
(F (u) = u, p = 2) can be also seen as the case of hard-margin SVM with a
modied Kernel. We introduce a new transformation

(x
i
) =
_
(x
i
) 0 . . . y
i
/
2C . . . 0
_
where the element y
i
/
C is at i + dim((x
i
)) position, and new vector w =
_
w
1
2C . . .
n
C
_
. In the new representation, the objective function |w|
2
/2+
C
n
i=1
2
i
becomes simply | w|
2
/2 whereas the inequality constrain y
i
_
(w)
T
x
i
+b
_
1
i
becomes y
i
_
w
T

(x
i
) +b
_
1. Hence, we obtain the hard-margin SVM
with a modied kernel which can be computed simply:
K(x
i
, x
j
) =

(x
i
)
T
(x
j
) = K(x
i
, x
j
) +

ij
2C
This kernel is consistent with QP program in the last remark.
70
In summary, the linear SVM is nothing else a special case of the non-linear SVM
within kernel approach. In the later, we study the SVM problem only for the two
case with hard and soft margin within the kernel approach. After obtaining the
optimal vector
by solving the associated QP program described above, we can

compute b by the KKT condition then derive the decision function f (x). We remind
that w
n
i=1
i
y
i
(x).
i. For the hard-margin case, KKT condition given in Appendix C.1.1:
i
_
y
i
_
w
T
(x
i
) +b
_
1
= 0
We notice that for the value
i
> 0, the inequality constraint becomes equal-
ity. As the inequality constraint becomes equality constrain, these points are
the closest points to the optimal frontier and they are called support-vectors.
Hence, b can be computed easily for a given support vector (x
i
, y
i
) as following:
b
= y
i
w
T
(x
i
)
In order to enhance the precision of b
, we evaluate this value as the average

all over the set SV of support vectors :
b
=
1
n
SV
iSV
y
i
i,jSV
j
y
j
(x
j
)
T
(x
i
)
=
1
n
SV
iSV
y
i
_
_
1
jSV
K (x
i
, x
j
)
_
_
ii. For the soft-margin case, KKT condition given in Appendix C.1.2 is slightly
dierent:
i
_
y
i
_
w
T
(x
i
) +b
_
1 +
i
= 0
However, if
i
satises the condition 0
i
C then we can show that
i
= 0. The condition 0
i
C denes the subset of training points
(support vectors) which are closest to the frontier of separation. Hence, b can
be computed by exactly the same expression as the hard-margin case.
From the optimal value of the triple (
, w
, b
), we can construct the decision

function which can be used to classied a given input x as following:
f (x) = sign
_
n
i=1
i
y
i
K (x, x
i
) +b
_
(3.8)
Regression problem
In the last sections, we have discussed the SVM problem only in the classication
context. In this section, we show how the regression problem can be interpreted as a
SVM problem. As discussed in the general frameworks of statistical learning (ERM
71
or VRM), the SVM problem consists of minimizing the risk function R
emp
or R
vic
.
The risk function can be computed via the loss function l (f (x) , y) which denes
our objective (classication or regression). Explicitly, the risk function is calculated
as:
R(f) =
_
l (f (x) , y) dP (x, y)
where the distribution dP (x, y) can be computed in the ERM framework or in the
VRM framework. For the classication problem, the loss function is dened as
l (f (x) , y) = I
f(x)=y
which means that we count as an error whenever the given
point is misclassied. The minimization of the risk function for the classication
can be mapped then to the minimization of the margin 1/ |w|. For the regression
problem, the loss function is l (f (x) , y) = (f (x) y)
2
which means that we count
the loss as the error of regression.
Remark 3 We have chosen here the loss as the least-square error just for illustra-
tion. In general, it can be replaced by any positive function F of f (x)y. Hence, we
have the loss function in general form l (f (x) , y) = F (f (x) y). We remark that
the least-square case corresponds to L
2
norm, then the most simple generalization
is to have the loss function as L
p
norm l (f (x) , y) = [f (x) y[
p
. We show later
that the special case with L
1
can bring the regression problem to a similar form of
soft-margin classication.
In the last discussion on the classication, we have concluded that the linear-SVM
problem is just a special case of non-linear-SVM within kernel approach. Hence, we
will work here directly with non-linear case where the training vector x is already
transformed by a non-linear application (x). Therefore, the approximate function
of the regression reads f (x) = w
T
(x)+b. In the ERM framework, the risk function
is estimated simply as the empirical summation over the dataset:
R
emp
=
1
n
n
i=1
(f (x
i
) y
i
)
2
whereas in the VRM framework, if we assume that dP (x, y) is a Gaussian noise of
variance
2
then the risk function reads:
R
vic
=
1
n
n
i=1
(f (x
i
) y
i
)
2
+
2
|w|
2
The risk function in the VRM framework can be interpreted as a regulated form of
risk function in the ERM framework. We rewrite the risk function after renormalizing
it by the factor 2
2
:
R
vic
=
1
2
|w|
2
+C
n
i=1
2
i
with C = 1/2
2
n. Here, we have introduced new variables = (
i
)
i=1...n
which
satisfy y
i
= f (x
i
) +
i
= w
T
(x
i
) + b +
i
. The regression problem can be now
72
written as a QP program with equality constrain as following:
min
w,b,
1
2
|w|
2
+C
n
i=1
2
i
u.c. y
i
= w
T
(x
i
) +b +
i
i = 1...n
In the present form, the regression looks very similar to the SVM problem for the
classication. We notice that the regression problem in the context of SVM can be
easily generalized by two possible ways:
The rst way is to introduce more general loss function F (f (x
i
) y
i
) instead
of the least-square loss function. This generalization can lead to other type of
regression such as -SV regression proposed by Vapnik (1998).
The second way is to introduce a weight
i
distribution for the empirical dis-
tribution instead of the uniform distribution:
dP
emp
(x, y) =
n
i=1
x
i
(x)
y
i
(y)
As nancial quantities depend more on the recent pass, hence an asymmetric
weight distribution in the favor of recent data would improve the estimator.
The idea of this generalization is quite similar to exponential moving-average.
By doing this, we recover the results obtained in Gestel T.V. et al., (2001) and
in Tay F.E.H. and Cao L.J. (2002) for the LS-SVM formalism. For examples,
we can choose the weight distribution as proposed in Tay F.E.H. and Cao L.J.
(2002):
i
= 2i/n(n + 1) (linear distribution) or
i
= (1 + exp (a 2ai/n))
(exponential weight distribution).
Our least-square regression problem can be mapped again to a dual problem
after introducing the Lagrangian. Detail calculations are given in Appendix C.1.
We give here the principle result which invokes again the kernel K
ij
= K (x
i
, x
j
) =
(x
i
)
T
(x
j
) for treating the non-linearity. Like the classication case, we consider
only two problems which are similar to the hard-margin and the soft-margin in the
context of regression.
i. Least-square SVM regression: In fact, the regression problem discussed
above similar to the hard-margin problem. Here, we have to keep the regular-
ization parameter C as it dene a tolerance error for the regression. However,
this problem with the L
2
constrain is equivalent to hard-margin with a modied
kernel. The quadratic optimization program is given as following:
max
T
y
1
2
T
_
K+
1
2C
I
_
(3.9)
u.c.
T
1 = 0
73
ii. -SVM regression The -SVM regression problem was introduced by Vapnik
(1998) in order to have a similar formalism with the soft-margin SVM. He
proposed to employ the loss function in the following form:
l (f (x) , y) = ([y f (x)[ ) I
{|yf(x)|}
The -SVM loss function is just a generalization of L
1
error. Here, is an
additional tolerance parameter which allows us not to count the regression
error small than . Insert this loss function into the expression of risk function
then we obtain the objective of the optimization problem:
R
vic
=
1
2
|w|
2
+C
n
i=1
([f (x
i
) y
i
[ ) I
{|y
i
f(x
i
)|}
Because the two ensembles y
i
f (x
i
) and y
i
f (x
i
) are disjoint.
We now break the function I
{|y
i
f(x
i
)|}
into two terms:
I
{|y
i
f(x
i
)|}
= I
{y
i
f(x
i
)0}
+I
{f(x
i
)y
i
}
By introducing the slack variables and
as the last case which satisfy the

condition
i
y
i
f (x
i
) and
i
f (x
i
) y
i
. Hence, we obtain the
following optimization problem:
min
w,b,,
1
2
|w|
2
+C
n
i=1
_
i
+
i
_
u.c. w
T
(x
i
) +b y
i
+
i
,
i
0 i = 1...n
y
i
w
T
(x
i
) b +
i
,
i
0 i = 1...n
Remark 4 We remark that our approach gives exactly the same result as the
traditional approach discussed in the work of Vapnik (1998) in which the ob-
jective function is constructed by minimizing the margin with additional terms
dening the regression error. These terms are controlled by the couple of slack
variables.
The dual problem in this case can be obtained by performing the same calcu-
lation as the soft-margin SVM:
max
,
_
T
y
_
+
_
T
1
1
2
_
_
T
K
_
_
(3.10)
u.c.
_
_
T
1 = 0, 0 ,
C1
For the particular case with = 0, we obtain:
max
T
y
1
2
T
K
u.c.
T
1 = 0, [[ C1
74
After the optimization procedure using QP program, we obtain the optimal vector
then compute b
by the KKT condition:

w
T
(x
i
) +b y
i
= 0
for support vectors (x
i
, y
i
) (see Appendix C.1.3 for more detail). In order to have
good accuracy for the estimation of b, we average over the set of support vectors SV
and obtain:
b
=
1
n
SV
n
SV
i=1
_
_
y
i
i
n
j=1
K (x
i
, x
j
)
_
_
The SVM regressor is then given by the following formula:
f (x) =
n
i=1
i
K (x, x
i
) +b
3.3.2 Primal approach

We discuss now the possible of an direct implementation for the primal problem. This
problem has been proposed and studied by Chapelle O. (2007). In this work, the
author argued that both primal and dual implementations give the same complexity
of the order O
_
max (n, d) min (n, d)
2
_
. Indeed, according to the author, the primal
problem might give a more accurate solution as it treats directly the quantity that
one is interested in. It is can be easily understood via the special case of a LS-SVM
linear estimator where both primal and dual problems can be solved analytically.
The main idea of primal implementation is to rewrite the optimization problem
under constraint as a unconstrained problem by performing a trivial minimization
on the slack variables . We then obtain:
min
w,b
1
2
|w|
2
+C
n
i=1
L
_
y
i
, w
T
(x
i
) +b
_
(3.11)
Here, we have L(y, t) = (y t)
p
for the regression problem whereas L(y, t) =
max (0, 1 yt)
p
for the classication problem. In the case with quadratic loss or
L
2
penalty, the function L(y, t) is dierentiable with respect to the second variable
hence one can obtain the zero gradient equation. In the case where L(y, t) is not dif-
ferentiable such as L(y, t) = max (0, 1 yt), we have to approximate it by a regular
function. Assuming that L(y, t) is dierentiable with respect to t then we obtain:
w+C
n
i=1
L
t
_
y
i
, w
T
(x
i
) +b
_
(x
i
) = 0
which leads to the following representation of the solution w:
w =
n
i=1
i
(x
i
)
75
By introducing the kernel K
ij
= K (x
i
, x
j
) = (x
i
)
T
(x
j
) we rewrite the primal
problem as following:
min
,b
1
2
T
K +C
n
i=1
L
_
y
i
, K
T
i
+b
_
(3.12)
where K
i
is the i
th
column of the matrix K. We note that it is now an uncon-
strained optimization problem which can be solved by gradient descent whenever
L(y, t) is dierentiable. In Appendix C.1, we present detail derivation of the primal
implementation in for the case quadratic loss and soft-margin classication.
3.3.3 Model selection - Cross validation procedure
The possibility to enlarge or restrict the function space let us the possibility to obtain
the solution for SVM problem. However, the choice of the additional parameter such
as the error tolerance C in the soft-margin SVM or the kernel parameter in the
extension to non-linear case is fundamental. How can we choose these parameters
for a given data set? In this section, we discuss the calibration procedure so-called
model selection which aims to determine the ensemble of parameters for SVM.
This discussion bases essentially on the result presented the O. Chapelles thesis
(2002).
In order to dene the calibration procedure, let us rst dene the test function
which is used to estimate the SVM problem. In the case where we have a lot of
data, we can follow the traditional cross validation procedure by dividing the total
data in two independent sets: the training set and the validation set. The training
set x
i
, y
i
1in
is used for the optimization problem whereas the validation set
x
i
, y
1im
is used to evaluate the error via the following test function:
T =
1
m
m
i=1
_
y
i
f
_
x
i
__
where (x) = I
{x>0}
with I
A
the standard notation of the indicator function. In the
case where we do not have enough data for SVM problem, we can employ directly
the training set to evaluate the error via the Leave-one-out error . Let f
0
be the
classier obtained by the full training set and f
p
be the one with the point (x
p
, y
p
)
left out. The error is dened by the test of the decision rule f
p
on the missing point
(x
p
, y
p
) as following:
T =
1
n
n
i=1
(y
p
f
p
(x
p
))
We focus more here the rst test error function with available validation data set.
However, the error function requires the step function which is discontinuous can
cause some diculty if we expect to determine the best selection parameter via the
optimal test error. In order to perform the search for minimal test error by gradient
76
descent for example, we should smooth the test error by regulate the step function
by:
(x) =
1
1 + exp (Ax +B)
The choice of the parameter A, B are important. If A is too small the approximation
error is too much whereas if A is large the test error is not smooth enough for the
minimization procedure.
3.4 Extension to SVM multi-classication
The single SVM classication (binary classication) discussed in the last section was
very well-established and becomes a very standard method for various applications.
However, the extension to the multi classication problem is not straight forward.
This problem still remains a very active research topic in the recognition domain.
In this section, we give a very quick overview on this progressing eld and some
practical implementations.
3.4.1 Basic idea of multi-classication
The multiclass SVM can be formulated as following. Let (x
i
, y
i
)
i=1...n
be the training
set of data with characteristic x R
d
under classication criterion y. For example,
the training data belong to m dierent classes labeled from 1 to m which means that
y 1, . . . , m. Our task is to determine a classication rule F : R
d
1, . . . , m
based on training set data which aims to predict to which class belongs the test data
x
t
by evaluating the decision rule f (x
t
).
Recently, many important contributions have progressed the eld both in the
accuracy and complexity (i.e. reduction of time computation). The extensions have
been developed via two main directions. The rst one consists of dividing the multi-
classication problem into many binary classication problem by using one-against-
all strategy or one-against-one. The next step is to construct the decision function
in the recognition phase. The implementation of the decision for one-against-all
strategy is based on the maximum output among all binary SVMs. The outputs are
usually mapped into an estimation probability which are proposed by dierent au-
thors such as Platt (1999). For one-against-onestrategy, in order to take the right
decision, the Max Wins algorithm is adopted. The resultant class is given by the
one voted by the majority of binary classiers. Both techniques encounter the limi-
tation of complexity and high cost of computation time. Other improvement in the
same direction such as the binary decision tree (SVM-BDT) was recently proposed
by Madzaro G. et al., (2009). This technique proved to be able to speed up the
computation time. The second direction consist of generalizing the kernel concept
in the SVM algorithm into a more general form. This method treats directly the
multiclassication problem by writing a general form of the large margin problem.
It will be again mapped into the dual problem by incorporating the kernel concept.
77
Crammer K. and Singer Y. (2001) introduced an ecient algorithm which decom-
poses the dual problem into multiple optimization problems which can be solved
later by xed-point algorithm.
3.4.2 Implementations of multiclass SVM
We describe here the two principal implementations of SVM for multiclassication
problem. The rst one concerns a direct application of binary SVM classier, however
the recognition phase requires a careful choice of decision strategy. We next describe
and implement the multiclass kernel-based SVM algorithm which is a more elegant
approach.
Remark 5 Before discussing details of the two implementations, we remark that
there exists other implementations of SVM such as the application of Nonnegative
Matrix Factorization (Poluru V. K. et al., 2009) in the binary case by rewriting the
SVM problem in NMF framework. Extension of this application to multiclassication
case must be an interesting topic for future work.
Decomposition into multiple binary SVM
The most two popular extensions of single SVM classier to multiclass SVM classier
are using the one-against-all strategy and one-against-all strategy. Recently, another
technique utilizing the binary decision tree provided less eort in training the data
and it is much faster for recognition phase with a complexity of the order O[log
2
N].
All these techniques employ directly the above SVM implementation.
a) One-against-all strategy: In this case, we construct m single SVM classiers
in order separate the training data from every class to the rest of classes.
Let us consider the construction of classier separating class k from the rest.
We start by attributing the response z
i
= 1 if y
i
= k and z
i
= 1 for all
y
i
1, . . . m / k. Applying this construction for all classes, we obtain
nally the m classiers f
1
(x) , . . . , f
m
(x). For a testing data x the decision
rule is obtained by the maximum of the outputs given by these m classiers:
y = argmax
k{1...m}
f
k
(x)
In order to avoid the error coming from the fact that we compare the output
corresponding to dierent classiers, we can map the output of each SVM into
the same form of probability proposed by Platt (1999):
Pr (
k
[ f
k
(x)) =
1
1 + exp (A
k
f
k
(x) +B
k
)
where
k
is the label of the k
th
class. This quantity can be interpreted as a
measure of the accepting probability of the classier
k
for a given point x with
78
output f
k
(x). However, nothing guarantees that

m
k=1
Pr (
k
[ f
k
(x)) = 1,
hence we have to renormalize this probability:
Pr (
k
[ x) =
Pr (
k
[ f
k
(x))
m
j=1
Pr (
j
[ f
j
(x))
In order to obtain these probability, we have to calibrate the parameters
(A
k
, B
k
). It can be realized by performing the maximum likehood on the
training set (Platt (1999)).
b) One-against-one strategy: Other way to employ the binary SVM classier
is to construct N
c
= m(m 1)/2 binary classiers which separate all couples
of classes (
i
,
j
). We denote the ensemble of classiers ( = f
1
, . . . , f
N
c
.
In the recognition phase, we evaluate all possible outputs f
1
(x) , . . . , f
N
c
(x)
over ( for a given point x. These outputs can be mapped to the response
function of each classier signf
k
(x) which determines to which class the point
x belongs with respect to the classier f
k
. We denote N
1
, . . . , N
m
numbers of
times that the point x is classied in the classes
1
, . . . ,
m
respectively. Using
the responses we can construct a probability distribution

Pr (
k
[ x) over the
set of classes
k
. This probability again is used to decide the recognition of
x.
c) Binary decision tree: Both methods above are quite easy for implementa-
tion as they employ directly the binary solver. However, they are all suer a
high cost of computation time. We discuss now the last technique proposed
recently by Madazarov G. et al., (2009)which uses the binary decision tree strat-
egy. With advantage of the binary tree, the technique gains both complexity
and computation time consumption. It needs only m 1 classiers which do
not always run on the whole training set for constructing the classiers. By
construction, for recognizing a testing point x, it requires only O(log
2
N) eval-
uation by descending the tree. Figure 3.2 illustrates how this algorithm works
for classifying 7 classes.
Multiclass Kernel-based Vector Machines
A more general and elegant formalism can be obtained for multiclassication by
generating the concept kernel. Within this discussion, we follow the approach given
in the work of Crammer G. et al., (2001) but with more geometrical explanation. We
demonstrate that this approach can be interpreted as a simultaneous combination of
one-against-all and one-against-one strategies.
As in the linear case, we have to dene a decision function. For the binary case,
f (x) = sign (h(x)) where h(x) is the boundary (i.e. f (x) = +1 if x class 1
whereas f (x) = 1 if x class 2). For the decision function must as-well indicate
the class index. In the work of Crammer K. et al., (2001), they proposed to construct
the decision rule F : R
d
1, . . . , m as following:
F (x) = argmax
k{1,...,m}
_
W
T
k
x
_
79
Figure 3.2: Binary decision tree strategy for multiclassication problem
Here, W is the dm weight matrix in which each column W
k
corresponds to a d1
weight vector. Therefore, we can rewrite the weight matrix as W = (W
1
W
2
. . . W
m
).
We remind that the vector x is of dimension d. In fact, the vectors W
k
corresponding
to k
th
class can be interpreted as the normal vector of the hyperplan in the binary
SVM. It characterizes the sensibility of a given point x to the k
th
class. The quantity
W
T
k
x is similar to a score that we attribute to the class
k
.
Remark 6 This construction looks quite similar to the one-against-all strategy.
The main dierence is that for the one-against-all strategy, all vectors W
1
. . . W
m
are constructed independently one by one with binary SVM whereas within this for-
malism, they are constructed spontaneously all together. We will show in the following
that the selection rule of this approach is more similar to one-against-one strategy.
Remark 7 In order to have an intuitive geometric interpretation, we treat here the
case of linear-classier. However, the generation to non-linear case will be straight
forward when we replace x
T
i
x
j
by
_
x
T
i
_
f (x
j
). This step introduces the notion of
kernel K (x
i
, x
j
) = (x
i
)
T
(x
j
).
By denition W
k
is the vector dening the boundary which distinguishes the class
k
from the rest. It is a normal vector to the boundary and point to the region
occupied by class
k
. Assuming that we are able to separate correctly all data by
classier W. For any point (x, y) when we compute the position of x with respect to
two classes
y
and
k
for all k ,= y, we must nd that x belongs to class
y
. As W
k
denes the vector pointing to the class
k
, hence when we compare a class
y
to a
class
k
, it is natural to dene the vector W
y
W
k
to dene the vector point to class
y
but not
k
. As consequence, W
k
W
y
is the vector point to class
k
but not
y
.
When x is well classied, we must have
_
W
T
y
W
T
k
_
x > 0 (i. e. the class
y
has
80
the best score). In order to have a margin like the binary case, we impose strictly
that
_
W
T
y
W
T
k
_
x 1 k ,= y. This condition can be translated for all k = 1 . . . m
by adding
y,k
(the Kronecker symbol) as following:
_
W
T
y
W
T
k
_
x +
y,k
1
Therefore, solving the multi-classication problem for training set (x
i
, y
i
)
i=1...n
is
equivalent to nding W satisfying:
_
W
T
y
i
W
T
k
_
x
i
+
y
i
,k
1 i, k
We notice here that w = W
T
i
W
T
j
is normal vector to the separation boundary
H
w
=
_
z[w
T
z +b
ij
= 0
_
between two classes
i
and
j
. Hence the width of the
margin between two classes is as in the binary case:
/(H
w
) =
1
|w|
Maximizing the margin is equivalent to minimizing the norm |w|. Indeed, we have
|w|
2
= |W
i
W
j
|
2
2
_
|W
i
|
2
+|W
j
|
2
_
. In order to maximize all the margin at
the same time, it turns out that we have to minimize the L
2
-norm of the matrix W:
|W|
2
2
=
m
i=1
|W
i
|
2
=
m
i=1
d
j=1
W
2
ij
Finally, we obtain the following optimization problem:
min
W
1
2
|W|
2
u.c.
_
W
T
y
i
W
T
k
_
x
i
+
y
i
,k
1 i = 1 . . . n, k = 1 . . . m
The extension the similar case withsoft-margin can be formulated easily bu in-
troducing the slack variables
i
corresponding for each training data. As before,
this slack variable allow the point to be classied in the margin. The minimization
problem now becomes:
min
W,
1
2
|W|
2
+C.F
_
n
i=1
p
i
_
u.c.
_
W
T
y
i
W
T
k
_
x
i
+
y
i
,k
1
i
,
i
0 i, k
Remark 8 Within the ERM or V RM frameworks, we can construct the risk func-
tion via the loss function l (x) = I
{F(x)=y}
for the couple of data (x, y). For example,
in the ERM framework, we have:
R
emp
(W) =
1
n
n
i=1
I
{F(x
i
)=y
i
}
81
The classication problem is now equivalent to nd the optimal matrix W
which
minimizes the empirical risk function. In the binary case, we have seen that the
optimization of risk function is equivalent to maximizing the margin |w|
2
under
linear constraint. We remark that in VRM framework, this problem can be tackled
exactly as the binary case. In order to prove the equivalence of minimizing the risk
function with the large margin principle, we look for a linear superior boundary the
indicator function I
{F(x)=y}
. As shown in Crammer K. et al., (2001), we consider
the following function:
g (x, y; k) =
_
W
T
k
W
T
y
_
x + 1
y,k
In fact, we can prove that
I
{F(x)=y}
g (x, y) = max
k
g (x, y; k) (x, y)
We rst remark that g (x, y; y) =
_
W
T
y
W
T
y
_
x + 1
y,y
= 0, hence g (x, y)
g (x, y; y) = 0. If the point (x
i
, y
i
) satises F (x
i
) = y
i
then W
T
y
i
x = max
k
W
T
k
x
i
and I
{F(x)=y}
(x
i
) = 0. For this case, it is obvious that I
{F(x)=y}
(x
i
) g (x
i
, y
i
). If
we have now F (x
i
) ,= y
i
then W
T
y
i
x < max
k
W
T
k
x
i
and I
{F(x)=y}
(x
i
) = 1. In this
case, g (x, y) = max
k
_
W
T
k
x
_
W
T
y
+1 1. Hence, we obtain again I
{F(x)=y}
(x
i
)
g (x
i
, y
i
). Finally, we obtain the upper boundary of the risk function by the following
expression:
R
emp
(W)
1
n
n
i=1
max
k
__
W
T
k
W
T
y
i
_
x
i
+ 1
y
i
,k
_
If the the data is separable, then the optimal value of the risk function is zero. If one
require that the superior boundary of the risk function is zero, then the W
which
optimizes this boundary must be the one optimize R
emp
(W). The minimization can
be expressed as:
max
k
__
W
T
k
W
T
y
i
_
x
i
+ 1
y
i
,k
_
= 0 i
or in the same form of the large margin problem:
_
W
T
y
i
W
T
k
_
x
i
+ 1 +
y
i
,k
1 i, k
Follow the traditional routine for solving this problem, we map it into the dual
problem as in the case with binary classication. The detail of this mapping is given
in K. Crammer and Y. Singer (2001). We summarize here their important result
in the dual form with the dual variable
i
of dimension m with i = 1 . . . n. Dene
i
= 1
y
i
i
where 1
y
i
is zero column vector except for i
th
element, then in the case
with soft margin p = 1 and F (u) = u we have the dual problem:
max
i
Q() =
1
2
i,j
_
x
T
i
x
j
_ _
T
i

j
_
+
1
C
_
n
i=1
T
i
1
y
i
_
u.c.
i
1
y
i
and
T
i
1 = 0 i
82
We remark here again that we obtain a quadratic program which involves only the
interior product between all couples of vector x
i
, x
j
. Hence the generation to non-
linear is straight forward with the introduction of the kernel concept. The general
problem is nally written by replacing the the factor
_
x
T
i
x
j
_
by the kernel K (x
i
, x
j
):
max
i
Q() =
1
2
i,j
K (x
i
, x
j
)
_
T
i

j
_
+
1
C
_
n
i=1
T
i
1
y
i
_
(3.13)
u.c.
i
1
y
i
and
T
i
1 = 0 i (3.14)
The optimal solution of this problem allows to evaluate the classication rule:
H(x) = arg max
r=1...m
_
n
i=1
i,r
K (x, x
i
)
_
(3.15)
For small value of class number m, we can implement the above optimization by
the traditional QP program with matrix of size mn mn. However, for important
number of class, we must employ ecient algorithm as stocking a mnmn is already
a complicate problem. Crammer and Singer have introduced an interesting algorithm
which optimize this optimization problem both in stockade and computation speed.
3.5 SVM-regression in nance
Recently, dierent applications in the nancial eld have been developed through
two main directions. The rst one employs SVM as non-linear estimator in order to
forecast the market tendency or volatility. In this context, SVM is used as a regression
technique with feasible possibility for extension to non-linear case thank to the kernel
approach. The second direction consists of using SVM as a classication technique
which aims to elaborate the stock selection in the trading strategy (for example
long/short strategy). The SVM-regression can be considered as a non-linear lter
for times series or a regression for evaluating the score. We discuss rst here how to
employ the SVM-regression as as an estimators of the trend for a given asset. The
observed trend can be used later for momentum strategies such as trend-following
strategy. We next use SVM as a method for constructing the score of the stock for
long/short strategy.
3.5.1 Numerical tests on SVM-regressors
We test here the eciency of dierent regressors discussed above. They can be
distinguished by the form of loss function into L
1
-type or L
2
type or by the form
of non-linear kernel. We do not focus yet on the calibration of SVM parameter and
reserve it for the next discussion on the trend extraction of nancial time series with
a full description of cross validation procedure. For a given times series y
t
we would
like to regress the data with the training vector x = t = (t
i
)
i=1...n
. Let us consider
83
two model of time series. The rst model is simply an determined trend perturbed
by a white noise:
y
t
= (t a)
3
+A (0, 1) (3.16)
The second model for our tests is the Black-Scholes model of the stock price:
dS
t
S
t
=
t
dt +
t
dB
t
(3.17)
We notice here that the studied signal y
t
= ln S
t
. The parameters of the model are
the annualized return = 5% and the annulized volatility = 20%. We consider
the regression on a period of one year corresponding to N = 260 trading days.
The rst test consists of comparing the L
1
-regressor and L
2
-regressor for Gaussian
kernel (see Figures 3.3-3.4). As shown in Figure 3.3 and Figure 3.4, the L
2
-regressor
seems to be more favor for the regression. Indeed, we observe that the L
2
-regressor is
more stable than the L
1
-regressor (i.e L
1
is more sensible to the training data set) via
many test on simulated data of Model 3.17. In the second test, we compare dierent
L
2
regressions corresponding to four typical kernel: 1. Linear, 2. Polynomial, 3.
Gaussian, 4. Sigmoid.
Figure 3.3: L
1
-regressor versus L
2
-regressor with Gaussian kernel for model (3.16)
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
20
15
10
5
0
5
10
15
20
y
t
t

Real signal
L
1
regression
L
2
regression
3.5.2 SVM-Filtering for forecasting the trend of signal
Here, we employ SVM as a non-linear ltering technique for extracting the hidden
trend of a time series signal. The regression principle was explained in the last
84
Figure 3.4: L
1
-regressor versus L
2
0 50 100 150 200 250 300
0.25
0.2
0.15
0.1
0.05
0
0.05
0.1
l
n
(
S
t
/
S
0
)
t

Real signal
L
1
regression
L
2
regression
Figure 3.5: Comparison of dierent regression kernel for model (3.16)
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
20
15
10
5
0
5
10
15
20
y
t
t

Real signal
Linear
Polynomial
Gaussian
Sigmoid
85
Figure 3.6: Comparison of dierent regression kernel for model (3.17)
0 50 100 150 200 250 300
0.2
0.15
0.1
0.05
0
0.05
0.1
0.15
y
t
t

Real signal
Linear
Polynomial
Gaussian
Sigmoid
discussion. We apply now this technique for estimating the derivative of the trend

t
, then plug it into a trend-following strategy.
Description of trend-following strategy
We choose here the most simple trend-following strategy whose exposure is given by:
e
t
= m

t

2
t
with m the risk tolerance and
t
the estimator of volatility given by:

2
t
=
1
T
_
T
0
2
t
dt =
1
T
t
i=tT+1
ln
2
S
i
S
i1
In order to limit the risk of explosion of the exposure e
t
, we capture it by a superior
and inferior boundaries e
max
and e
min
:
e
t
= max
_
min
_
m

t

2
t
, e
min
_
, e
max
_
The wealth of the portfolio is then given by the following expression:
W
t+1
= W
t
+W
t
_
e
t
_
S
t+1
S
t
1
_
+ (1 e
t
)r
t
_
86
SVM-Filtering
We discuss now how to build a cross-validation procedure which can help to learn
the trend of a given signal. We employ the moving-average as a benchmark to
compare with this new lter. An important parameter in moving-average ltering
is the estimation horizon T then we use this horizon as a reference to calibrate
our SVM-ltering. For the sake of simplicity, we studied here only the SVM-lter
with Gaussian kernel and L
2
penalty. The two typical parameters of SVM-lter
are C and . C is the parameter which allows some certain level of error in the
regression curve while characterizes the horizon of estimation and it is directly
proportional to T. We propose to scheme of the validation procedure which base
on the following structure of data division: training set, validation set and testing
set. In the rst scheme, we x the kernel parameter = T and optimize the error
tolerance parameter C on the validation set. This scheme is comparable to our
benchmark moving-average. The second scheme consists of optimizing both couple
of parameter C, on the validation set. In this case, we let our validation data
decides its estimation horizon. This scheme is more complicate to interpret as is
now a dynamic parameter. However, by aecting to the local horizon, we can have
an additional understanding on the change in the price of the underlying asset. For
example, we can determine in the historical data if the underlying asset undergoes a
period with long or short trend. It can help to recognize some additional signature
such as the cycle of between long and short trends. We report the two schemes in
the following algorithm.
Figure 3.7: Cross-validation procedure for determining optimal value C
[
-
|
[
-
T
1
Training
[
-
T
2
Validation
[
-
T
2
Forecasting
Backtesting
We rst check the SVM-lter with simulated data given by the Black-Scholes model of
the price. We consider a stock price with annualized return = 10% and annualized
volatility = 20%. The regression is based on 1 trading year data (n = 260 days)
with a xed horizon of 1 month T = 20 days. In Figure 3.8, we present the result
of the SVM trend prediction with xed horizon T = 20 whereas Figure 3.9 presents
the SVM trend prediction for the second scheme.
3.5.3 SVM for multivariate regression
As a regression method, we can employ SVM for the use of multivariate regression.
Assuming that we consider an universal of d stocks X =
_
X
(i)
_
i=1...d
during the
87
Figure 3.8: SVM-ltering with xed horizon scheme
0 50 100 150 200 250 300
0.2
0.15
0.1
0.05
0
0.05
0.1
0.15
y
t
t

Real signal
Training
Validation
Prediction
Figure 3.9: SVM-ltering with dynamic horizon scheme
0 50 100 150 200 250 300
0.2
0.15
0.1
0.05
0
0.05
0.1
0.15
y
t
t

Real signal
Training
Validation
Prediction
88
Algorithm 3 SVM score construction
procedure SVM_Filter(X, y, T)
Divide data into training set T
train
, validation set T
valid
and testing set T
test
Regression on the training data T
train
Construct the SVM prediction on validation set T
valid
if Fixed horizon then
= T
Compute Error(C) prediction error on T
valid
Minimize Error(C) and obtain the optimal parameters (C
)
else
Compute Error(, C) prediction error on T
valid
Minimize Error(, C) and obtain the optimal parameters (
, C
)
end if
Use optimal parameters to predict the trend on testing set T
test
end procedure
period of n dates. The performance of the index or an individual stock that we are
interested in is given by y. We are looking for the prediction of the value of y
n+1
by
using the regression of the historical data of (X
t
, y
t
)
t=1...n
. In this case, the dierent
stocks play the role of the factors of vector in the training set. We can as well apply
other regression like the prediction of the performance of the stock based on available
information of all the factors.
Multivariate regression
We rst test here the eciency of the multivariate regression on a simulated model.
Assuming that all the factors at a given date j follow a Brownian motion.
dX
(i)
t
=
t
dt +
t
dB
(i)
t
i = 1 . . . d
Let (y
t
)
1 n
be the vector to be regressed which is related to the input X by a function:
y
t
= f(X
t
) = W
T
t
X
t
We would like to regress the vector y = (y
t
)
t=2...n
by the historical data (X
t
)
t=1...n1
by SVM-regression. This regression is give by the function y
t
= F(X
t1
). Hence,
the prediction of the future performance of y
n+1
is given by:
E[y
n+1
[X
n
] = F (X
n
)
In Figure 3.10, we present the results obtained by Gaussian kernel with L
1
and
L
2
penalty condition whereas in Figure 3.11, we compare the result obtained with
dierent types of kernel. Here, we consider just a simple scheme with the lag of one
trading day for the regression. In all Figures, we remark this lack on the prediction
of the value of y.
89
Figure 3.10: L
1
-regressor versus L
2
0 50 100 150 200 250 300 350 400 450 500
3
2
1
0
1
2
3
4
5
6
y
t
t

Real signal
L
1
regression
L
2
regression
Figure 3.11: Comparison of dierent kernels for multivariate regression
0 50 100 150 200 250
3
2
1
0
1
2
3
4
5
y
t
t

Real signal
Linear
Polynomial
Gaussian
Sigmoid
90
Backtesting
3.6 SVM-classication in nance
We now discuss the second applications of SVM in the nance as a stock classier
within this section. We will rst test our implementations of the binary classier and
multiclassier. We next employ the SVM technique to study two dierent problems:
(i) recognition of sectors and (ii) construction of SVM score for stock picking strategy.
3.6.1 Test of SVM-classiers
For the binary classication problem, we consider the both approaches (dual/primal)
to determine the boundary between two given classes based on the available infor-
mation of each data point. For the multiclassication problem, we rst extend the
binary classier to the multi-class case by using the binary decision tree (SVM-
BDT). This algorithm was demonstrated to be more ecient than the traditional
approaches such as one-against-all or one-against-one both in computation time
and in precision. The general approach of multi-SVM will be then compared to
SVM-BDT.
Binary-SVM classier
Let us compare here the two proposed approaches (dual/primal) for solving numeri-
cally SVM-classication problem. In order to realize the test, we consider a random
training data set of n vector x
i
with classication criterion y
i
= sign (x
i
). We present
here the comparison of two classication approaches with linear kernel. Here, the
result of primal approach is directly obtained by the software of O. Chapelle
2
. This
software was implemented with L
2
penalty condition. Our dual solver is implemented
for both L
1
and L
2
penalty conditions by employing simply the QP program. In Fig-
ure 3.12, we show the results of classication obtained by both methods within L
2
penalty condition.
We test next the non-linear classication by using the Gaussian kernel (RBF
kernel) for the binary dual-solver. We generate the simulated data by the same way
as the last example with x R
2
. The result of the classication is illustrated in
Figure 3.13 for RBF kernel with parameter C = 0.5 and = 2
3
.
Multi-SVM classier
We rst test the implementation of SVM-BDT via simulated data (x
i
)
i=1...n
which
are generated randomly. We suppose that these data are distributed in N
c
classes.
In order to test eciently our multi-SVM implementation, the response vector y =
2
The free software of O. Chapelle can be found in the following website
http://olivier.chapelle.cc/primal/
3
We used here the plotlssvm function of the LS-SVM toolbox for graphical illustration. Similar
result was aso obtained by using trainlssvm function in the same toolbox.
91
Figure 3.12: Comparison between Dual algorithm and Primal algorithm
0 10 20 30 40 50 60 70 80 90 100
4
2
0
2
4
6
h
(
x
,
y
)
Training data

Primal, Dual, Boundary, Margins
Figure 3.13: Illustration of non-linear classication with Gaussian kernel
1
1
1
1
1
X
1
X
2

3 2.5 2 1.5 1 0.5 0 0.5 1 1.5 2
2.5
2
1.5
1
0.5
0
0.5
1
1.5
2
2.5
class 1
class 2
92
(y
1
. . . y
n
) is supposed to be dependent only on the rst coordinate of the data vector:
z = | (0, 1)
x
1
= N
c
z
y = [x
1
] +A (0, 1)
x
i
= | (0, 1) i > 1
Here [a] denote the part of a. We can generate our simulated data in much more
general way but it will be very hard to visualize the result of the classication.
Within the above choice of simulated data, we can see that in the case = 0 the
data a separable in the axis x
1
. In the geometric view, the space R
d
is divided in to
N
c
zones along the axis x
1
: R
d1
[0, 1[, . . . , R
d1
[N
c
, N
c
+ 1[. The boundaries
are simply the N
c
hyperplane R
d1
crossing x
1
= 1 . . . N
c
. When we introduce some
noise on the coordinate x
1
( > 0), then the training set is now is not separable
by these ensemble of linear hyperplanes. There will be some misclassied points
and some deformation of the boundaries thank to non-linear kernel. For the sake of
simplicity, we assume that the data (x, y) are already gathered by group. In Figures
?? and 3.15, we present the classication results for in-sample data and out-of-simple
data in the case = 0 (i.e. separable data). We are now introduce the noise in the
Figure 3.14: Illustration of multiclassication with SVM-BDT for in-sample data
S10 S20 S30 S40 S50 S60 S70 S80 S90 S99
C01
C02
C03
C04
C05
C06
C07
C08
C09
C10
Stocks
C
l
a
s
s
e
s

Real sector distribution
Multiclass SVM
data coordinate x
1
with = 0.2.
93
Figure 3.15: Illustration of multiclassication with SVM-BDT for out-of-sample data
S05 S10 S15 S20 S25 S30 S35 S40 S45 S50
C01
C02
C03
C04
C05
C06
C07
C08
C09
C10
Stocks
C
l
a
s
s
e
s

Multiclass SVM
Figure 3.16: Illustration of multiclassication with SVM-BDT for = 0
1 2 3 4 5 6 7 8 9
0
0.2
0.4
0.6
0.8
1
1.2
x
1
x
2

C1, C2, C3, C4, C5, C6, C7, C8, C9, C10
94
Figure 3.17: Illustration of multiclassication with SVM-BDT for = 0.2
1 2 3 4 5 6 7 8 9 10
0
0.2
0.4
0.6
0.8
1
1.2
x
1
x
2

C1, C2, C3, C4, C5, C6, C7, C8, C9, C10
3.6.2 SVM for classication
We employ here multi-SVM algorithm for all the compositions of the Eurostoxx 300
index. Our goal is to determine the boundaries between various sector to which
belong the compositions of the index. As the algorithm contains two main parts,
classication and prediction, we then can classify our stocks via their common prop-
erties resulted from the available factors. The number of misclassied stocks or the
error of classication can give us an estimation on sector denition. We next study
the recognition phase on the ensemble of tested data.
Classication of stocks by sectors
In order to well classify the stocks composing the Eurostoxx 300 index, we consider
the N
train
= 100 most representative stocks in term of value. In order to establish the
multiclass-svm classication using the binary decision tree, we sort the N
train
= 100
assets by sectors. We next employing the SVM-BDT for computing the N
train
1
binary separators. In Figure 3.18, we present the classication result with Gaussian
kernel and L
2
penalty condition. For = 2 and C = 20, we are able to well
classify the 100 assets over ten main sectors: Oil & Gas, Industrials, Financial,
Telecommunications, Health Care, Basic Materials, Consumer Goods, Technology,
Utilities, Consumer Services.
In order to check the eciency of the classication, we test the prediction quality
on a test set which is composed by N
test
= 50 assets. In Figure 3.19, we compare
the SVM-BDT result with the true sector distribution of 50 assets. We obtain in
95
Figure 3.18: Multiclassication with SVM-BDT on training set
S1 S10 S20 S30 S40 S50 S60 S70 S80 S90 S100
Oil & Gas
Industrials
Financials
Telecommunications
Health Care
Basic Materials
Consumer Goods
Technology
Utilities
Consumer Services

Multiclass SVM
this case the rate of correct prediction is about 58%.
Calibration procedure
As discussed above in the implementation part of the SVM-solver, there are two
kinds of parameter which play important role in the classication process. The rst
parameter C concerns the tolerance error of the margin and the second parameters
concern the choice of kernel ( for Gaussian kernel for example). In last example,
we have optimized the couple of parameters C, in order to have the best classiers
which do not commit any error on the traing set. However, this result is true only
in the case if the sectors are correctly dened. Here, nothing guaranties that the
given notion of sectors is the most appropriate one. Hence, the classication process
should consist of two steps: (i) determine of binary SVM classiers on training data
set and (ii) calibrate parameters on the validation set. In fact, we decide to optimize
this couple of parameters C, by minimizing the realized error on the validation
set because the committed error on the training set (learning set) must be always
smaller than the one on validation set (unknown set). In the second phase, we can
redene the sectors in the sens that if any asset is misclassied, we change its sector
label and repeat the optimization on the validation set until convergence. In the end
of the calibration procedure, we expect to obtain rst a new recognition of sectors
and second a multi-classiers for new assets.
As SVM uses the training set to lean about the classication, it must commits
less error on this set than on the validation set. We propose here to optimize the
96
Figure 3.19: Prediction eciency with SVM-BDT on the validation set
S101 S110 S120 S130 S140 S150
Oil & Gas
Industrials
Financials
Telecommunications
Health Care
Basic Materials
Consumer Goods
Technology
Utilities
Consumer Services

Multiclass SVM
SVM parameters by minimizing the error on the validation set. We use the same
error function dened in Section 3 but apply it on the validation data set 1:
Error =
1
card (1)
iV
_
y
i
f
_
x
i
__
where (x) = I
{x>0}
with I
A
the standard notation of the indicator function. How-
ever, the error function requires the step function which is discontinuous can cause
some diculty if we expect to determine the best selection parameter via the optimal
test error. In order to perform the search for minimal test error by gradient descent
for example, we should smooth the test error by regulate the step function by:
(x) =
1
1 + exp (Ax +B)
The choice of the parameter A, B are important. If A is too small the approximation
error is too much whereas if A is large the test error is not smooth enough for the
minimization procedure.
Recognition of sectors
By construction, SVM-classier is a very ecient method for recognize and classify
a new element with respect to a given number of classes. However, it is not able to
recognize the sectors or introduces a new correct denition of available sectors over
an universal of available data (stocks). In nance, the classication by sector is more
97
related to the origin of stock than the intrinsic property of the stock in the market.
It may introduce some problem on the trading strategy if a stock is misclassied, for
example, the case of pair-trading strategy. Here, we try to overcome this weakness
point of SVM in order to introduce a method which modies the initial denition of
sectors.
The main idea of sector recognition procedure is the following. We divide the
available data into two sets: training set and validation set. We employ the training
set to learn about the classication and the validation set to optimize the SVM
parameters. We start with the initial denition of the given sectors. Within each
iteration, we learn the training set in order to determine the classiers then we test
the validation error. An optimization procedure on the validation error helps us to
determine the optimal parameters of SVM. For each ensemble of optimal parameters,
we encounter some error on the training set. If the validation is smaller on certain
threshold with no error on the training set, we reach the optimal conguration of
sector denition. In the case, there are errors on the training set, we relabel the
misclassied data point and dene new sectors with this correction. All the sector
labels will be changed by this rule for both training and validation sets. The iteration
procedure will be repeat until no error on the training set is committed for a given
expected threshold of error on the validation set. The algorithm of this sector-
recognition procedure is summarized in the following table:
Algorithm 4 Sector recognition by SVM classication
procedure SVM_SectorRecognition(X, y, )
Divide the historical data by training set T and validation set 1
Initiate the sectors label by the physical sector names: Sec
0
1
, . . . , Sec
0
m
while E
T
> do
while E
V
> do
Compute the SVM separators for labels Sec
1
, . . . , Sec
m
on T for given
(C, )
Construct the SVM predictor from the separator Sec
1
, . . . , Sec
m
Compute error E
V
on validation set
Update parameter (C, ) until convergence of E
V
>
end while
Compute error E
T
on training set
Verify misclassied points of training set
Relabel the misclassied points then update denition of sectors
end while
end procedure
3.6.3 SVM for score construction and stock selection
Traditionally, in order to improve the stock picking we rank the stocks by construct-
ing a score based on all characterizations (so-called factor) of the considered stock.
We require that the construction of this global quantity (combination of factors)
98
must satisfy some classication criterion, for example the performance. We denote
here the (x
i
)
i=1...n
with x
i
the ensemble of factors for the i
th
stock. The classication
criterion such as the performance is denoted by the vector y = (y
i
)
i=1...n
. The aim
of SVM-classier in this problem is to recognize which stocks (scores) belong to the
high/low performance class (overperformed/underperformed). More precisely, we
have to identify the a boundary of separation as a function of score and performance
f (x, y). Hence, the SVM stock peaking consists of two steps: (i) construction of
factors ensemble (i.e. harmonize all characterizations of a given stock such as the
price, the risk, marco-properties e.t.c into comparable quantities); (ii) application of
SVM-classier algorithm with adaptive choice of parameters. In the following, we
are going to rst give a brief description of score constructions and then establish
the backtest on stock-picking strategy.
Probit model for score construction
We summary here briey the main idea of the score construction by the Probit
model. Assuming that the set of training data (x
i
, y
i
)
i=1...n
is available. Here x is
the vector of factors whereas y is the binary response. We look for constructing a
conditional probability distribution of the random variable Y for a given point X.
This probability distribution can be used later to predict the response of a new data
point x
new
. The probit model suppose to estimate this conditional probability in the
form:
Pr (Y = 1 [X) =
_
X
T
+
_
with (x) the cumulative distribution function (CDF) of the standard normal dis-
tribution. The couple of parameters (, ) can be obtained by using estimators of
maximum likehood. The choice of the function (x) is quite natural as we work
with a binary random variable because it allows to have a symmetric probability
distribution.
Remark 9 We remark that this model can be written in another form with the in-
troduction of a hidden random variable:
Y
= X
T
+ +
where A (0, 1). Hence, Y can be interpreted as an indicator for whether Y
is
positive.
Y = I
{Y
>0}
=
_
1 if Y
> 0
0 otherwise
In nance, we can employ this model for the score construction. If we dene the
binary variable Y is the relative return of a given asset with respect to the benchmark:
Y = 1 if the return of is higher than the one of the benchmark and Y = 0 otherwise.
Hence, Pr (Y = 1[X) is the probability for the give asset with the vector of factors X
to be super-performed. Naturally, we can dene this quantity as a score measuring
the probability of gain over the benchmark:
S = Pr (Y = 1[X)
99
In order to estimate the regression parameters , , we maximize the log-likehood
function:
/(, ) =
n
i=1
y
i
ln
_
x
T
i
+
_
+ (1 y
i
) ln
_
1
_
x
T
i
+
__
Using the estimated parameters by maximum likehood, we can predict the score of
the a given asset with its factor vector X as following:
S =
_
X
T

+
_
The probability distribution of the score

S can be computed by the empirical formula
Pr
_
S < s
_
=
1
n
n
i=1
I
{S
i
<s}
However, if we compute the density distribution function (PDF), we obtain a sum of
Dirac functions. In order to obtain a smooth distribution, we convoluate this density
with a Gaussian kernel, then the PDF functions reads:
p
S
(s) =
1
n
n
i=1
1
2
e
(ss
i
)
2
/2
2
with is a smoothing parameter.
In order to test the numerical implementation, we employ the Probit model for th
simulated data which is generated in the same way as the hidden variable discussed
in the remark. Let (x
1
, . . . , x
n
) A
_
0,
2
_
be the data set with d factors (i.e.
x
i
R
d
). For all simulations, we took = 0.1. The binary response is given by the
following model:
Y
0
= X
T
0
+
0
+A(0, 1)
Y = I
{Y
0
>0}
Here, the parameters of the model
0
and
0
are chosen as
0
= 0.1 and
0
= 1. We
employ the Probit regression in order to determine the score of n = 500 data in the
cases d = 2 and d = 5. The comparisons between the Probit score and the simulated
score are presented in Figures 3.20-3.22
SVM score construction
We discuss now how to employ SVM to construct the score for a given ensemble of
the assets. In the work of G. Simon (2005), the SVM score is constructed by using
SVM-regression algorithm. In fact, with SVM-regression algorithm, we are able to
forecast the future performance E[
t+1
[X
t
] =
t
based on the present ensemble of
factor then this value can be employed directly as the prediction in a trend-following
strategy without need of score construction. We propose here another utilization
100
Figure 3.20: Comparison between simulated score and Probit score for d = 2
0 50 100 150 200 250 300 350 400 450 500
0.35
0.4
0.45
0.5
0.55
0.6
0.65
0.7
S
c
o
r
e
Assets

Simulated score
Probit score
Figure 3.21: Comparison between simulated score CDF and Probit score CDF for
d = 2
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
C
D
F
Score

Simulated CDF
Probit CDF
101
Figure 3.22: Comparison between simulated score PDF and Probit score PDF for
d = 2
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
1
2
3
4
5
6
P
D
F
Score

Simulated PDF
Probit PDF
of SVM algorithm based on SVM-classication for building the scores which allow
later to implement long/short strategies by using the selection curves. Our main
idea of SVM-score construction is very similar to Probit model. We rst dene a
binary variable Y
i
= 1 associated to each asset x
i
. This variable characterizes the
performance of the asset with respect to the benchmark. If Y
i
= 1, the stock is
underperformed whereas Y
i
= 1 the stock is overperformed. We next employ the
binary SVM-classication to separate the universal of stocks into two classes: high
performance and low performance. Finally, we dene the score of each stock the its
distance to the boundary decision.
Selection curve
In order to construct a simple strategy of type long/short for example, we must be
able to establish a selection rule based on the score obtained by Probit model and
SVM regression. Depending on the strategy long, short or long/short, we expect to
build a selection curve which determine the portion of assets which have a certain
level of error. For a long strategy, we prefer to buy a certain portion of high perfor-
mance with the knowledge on the possible committed error. To do so, we dene a
102
selection curve for which the score plays the role of the parameter:
Q(s) = Pr (S s)
E (s) = Pr (S s [Y = 0)
s [0, 1]
This parametric curve can be traced in the the square [0, 1] [0, 1] as shown in
Figure 3.23. On the x-axis, Q(s) denes the quantile corresponding to the stock
selection among the considered universal of stocks. On the y-axis, E (s) denes
the committed error corresponding to the stock selection. Precisely, for a certain
quantile, it measures the chance that we pick the bad performance stock. Two
trivial limits are the points (0, 0) and (1, 1). The rst point corresponds to the limit
with no selection whereas the second point corresponds to the limit with all selection.
A good score construction method should allow a selection curve as much convex as
possible because it guaranties a selection with less error.
Figure 3.23: Selection curve for long strategy for simulated data and Probit model
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
P
r
(
S
>
s
|
Y
=
0
)
Pr(S > s)

Simulated data
Probit model
Reciprocally, for a short strategy, the selection curve can be obtained by tracing
the following parametric curve:
Q(s) = Pr (S s)
E (s) = Pr (S s [Y = 1)
s [0, 1]
Here, Q(s) aims us to determine the quantile of low-performance stocks to be shorted
while E (s) helps us to avoid selling the high-performance one. As the selection
103
Figure 3.24: Probit scores for Eurostoxx data with d = 20 factors
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
P
r
(
S
>
s
|
Y
=
0
)
Pr(S > s)

Probit on Training
Probit on Validation
curve is independent of the score denition, it is an appropriate quantity to compare
dierent scoring techniques. In the following, we employ the selection curve for
comparing the score constructions of the Probit model and of the SVM-regression.
Figure 3.24 shows the comparison of the selection curves constructed by SVM score
and Probit score on the training set. Here, we did not eectuate any calibration on
the SVM parameters.
Backtesting and comparison
As presented in the last discussion on the regression, we have to build a cross valida-
tion procedure to optimize the SVM parameters. We follow the traditional routine
by dividing the data in three independent sets: (i)training set, (ii)validation set and
(iii)testing set. The classier is obtained by the training set whereas its optimal pa-
rameters (C, ) will be obtained by minimizing the tting error on the validation set.
The eciency of the SVM algorithm will be nally checked on the testing set. We
summarize the cross-validation procedure in the below algorithm. In order to make
the training set close to both validation data and testing data, we decide to divide
the data in the the following time order: validation set, training set and testing set.
Using this way, the prediction score on the testing set contains more information in
the recent past.
We now employ this procedure to compute the SVM score on the universal of
stocks of Eurostoxx index. Figure 3.25 present the construction of the score basing
on the the training set and validation set. The SVM parameters are optimized on
104
Algorithm 5 SVM score construction
procedure SVM_Score(X, y)
Divide data into training set T
train
, validation set T
valid
and testing set T
test
Classify the training data by using high/low performance criteria
Compute the decision boundary on T
train
Construct the SVM score on T
valid
by using the distance to the decision bound-
ary
Compute Error(, C) prediction error and classication error on T
valid
Minimize Error(, C) and obtain the optimal parameters (
, C
)
Use optimal parameters to compute the nal SVM-score on testing set T
test
end procedure
the validation set while the nal score construction uses both training and validation
set in order to have largest data ensemble.
Figure 3.25: SVM scores for Eurostoxx data with d = 20 factors
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
P
r
(
S
>
s
|
Y
=
0
)
Pr(S > s)

SVM Training
SVM Validation
SVM Testing
3.7 Conclusion
Support vector machine is a well-established method with a very wide use in various
domain. In the nancial point of view, this method can be used to recognize and
to predict the high performance stocks. Hence, SVM is a good indicator to build
ecients trading strategy over an universal of stocks. Within this paper, we rst
have revisited the basic idea of SVM in both classication and regression contexts.
105
The extension to the case of multi-classication is also discussed in detail. Various
applications of this technique were introduced and discussed in detail. The rst class
of applications is to employ SVM as forecasting method for time-series. We proposed
two applications: the rst one consists of using SVM as a signal lter. The advan-
tage of the method is that we can calibrate the model parameter by using only the
available data. The second application is to employ SVM as a multi-factor regression
technique. It allows to rene the prediction with additional inputs such as economic
factors. For the second class of applications, we deal with SVM classication. Two
main applications that we discussed in the scope of this paper are the score construc-
tion and the sector recognition. Both resulting information are important to build
momentum strategies which are the core of the modern asset management.
106
Bibliography
[1] Allwein E. L. et al., (2000) , Reducing Multiclass to Binary: A Unifying
Approach for Margin Classiers, Journal of Machine Learning Research, 1, pp.
113-141.
[2] At A. (2005), Optimisation dun Score de Stock Screening, Rapport de stage-
ENSAE, Socit Gnrale Asset Management.
[3] Basak D., Pal S. and Patranabis D.J. (2007), Support Vector Regression,
Neural Information Processing, 11, pp. 203-224.
[4] Ben-Hur A. and Weston J. (2010), A Users Guide to Support Vector Ma-
chines, Methods In Molecular Biology Clifton Nj, 609, pp. 223-239.
[5] Burges C. J. C. (1998), A Tutorial on Support Vector Machines for Pattern
Recognition, Data Mining and Knowledge Discovery, 2, pp. 121-167.
[6] Chapelle O. (2002), Support Vector Machine: Induction Principles, Adaptive
Tuning and Prior Knowledge PhD thesis, Paris 6.
[7] Chapelle O. et al., (2002), Choosing Multiple Parameters for Support Vector
Machine, Machine Learning, 46, pp. 131-159.
[8] Chapelle O. (2007), Training a Support Vector Machine in the Primal, Journal
Neural Computation, 19, pp. 1155-1178.
[9] Cortes C. and Vapnik V. (1995), Support-Vector Networks, Machine Learn-
ing, 20, pp. 273-297.
[10] Crammer K. and Singer Y. (2001), On the Algprithmic Implementation of
Multiclass Kernel-based Vector Machines, Journal of Machine Learning Re-
search, 2, pp. 265-292.
[11] Gestel T. V. et al., (2001), Financial Time Series Prediction Using Least
Squares Support Vector Machines Within the Evidence Framework, IEEE
Transactions on neural Networks, 12, pp. 809-820.
[12] Madzarov G. et al., (2009), A multi-class SVM Classier Utilizing Binary
Decision Tree ,Informatica, 33, pp. 233-241.
107
[13] Milgram J. et al., (2009), One Against One or One Against All: Which One
is Better for Handwriting Recognition with SVMs? (2006) ,Tenth International
Workshop on Frontiers in Handwriting Recognition.
[14] Potluru V. K. et al., (2009), Ecient Multiplicative updates for Support
Vector Machines ,Proceedings of the 2009 SIAM Conference on Data Mining.
[15] Simon G. (2005), LEconomtrie Non Linaire en Gestion Alternative, Rapport
de stage-ENSAE, Socit Gnrale Asset Management.
[16] Tay F.E.H. and Cao L.J. (2002), Modied Support Vector Machines in Finan-
cial Times Series forecasting,Neurocomputing,48, pp. 847-861
[17] Tsochantaridis I. et al., (2004), Support Vector Machine Learning for Inter-
dependent and Structured Output Spaces,Proceedings of the 21 st International
Confer- ence on Machine Learning,Ban, Canada.
[18] Vapnik V. (1998), Statistical Learning Theory, John Wiley and Sons,New York.
108
Chapter 4
Analysis of Trading Impact in the
CTA strategy
We review in this chapter the trend-following strategies within Kalman lter and
study the impact of the trend estimator error. We rst study the case of momentum
strategy on uni-asset case then generalize the analysis to the multi-asset case. In
order to construct the allocation strategy, we employ the observed trend which is
ltered by exponential moving average. It can be demonstrated that the cumulated
return of the strategy can be broken down into two important parts: the option prole
which is similar in concept to the straddle prole suggested by Fung and Hsied (2001)
and the trading impact which involves directly the estimator error on the eciency of
strategy. We focus in this paper on the second quantity by estimating its probability
distribution function and associated gain and loss expectations. We illustrate how the
number of assets and their correlations inuence on the performance of a strategy
via a toy model. This study can reveal important results which can be directly
tested on CTA fund such as the Epsilon fund.
Keywords: CTA, Momentum strategy, Trend following, Kalman lter, Trading
impact, Chi-square distribution.
4.1 Introduction
Trend-following strategies are specic example of an investment style that emerged
as an industry recently. They are so-called Commodity Trading Advisors (CTA)
and play an important role in the Hedge Fund industry (15% of total Hedge Fund
AUMs). Recently, this investment style has been carefully reviewed and analyzed
the 7
th
White Paper of Lyxor edition. We present here a complementary result
of this nice paper and give more specic analysis on a typical CTA. We will focus
here on the trading impact by estimating its probability distribution function and
associated gain and loss expectations. We illustrate how the number of assets and
their correlations inuence on the performance of a strategy via a Toy model. This
109
Analysis of Trading Impact in the CTA strategy
study can reveal important results which can be directly tested on CTA fund such
as the Epsilon fund.
This chapter is organized as following. In the rst part, we remind the main result
of trend-following strategy in the univariate case which has been demonstrated in
the 7
th
White Paper of Lyxor. We next generalize this result into the multivariate
case which establishes a framework for studying the impacts of the correlation and
the number of assets in a CTA fund. Finally, we nish with the study of a toy model
which permits to understand the eciency of trend-following strategy.
4.2 Conclusion
Momentum strategies are ecient ways to use the market tendency for building trad-
ing strategies. Hence, a good estimator of the trend is essential from this perspective.
In this paper, we study the impact of estimator error on a trend-following strategy
both in the single asset case and multi-asset case. The objective of this paper is
twofold. First, we have establish the general framework for analyzing a CTA fund.
Second, we illustrate important results of the trading impact on CTA strategy via a
simple Toy Model . We have shown that the gain probability and gain expectation
depend strongly on the correlation and the number of assets. Increasing the number
of asset can help to improve the performance and reduce the risk (volatility) within
a momentum strategy. However, when the number of asset reaches certain limit, we
observe a saturation of performance. It implies that above this limit, putting more
assets does not improve too much the performance but it does make the strategy
more complicate and increase the management cost as the portfolio is rebalanced
frequently. The correlation of between assets play an important role as well. As
usual, the higher correlation level is, the less ecient strategies are. Interestingly,
we remark that when the correlation increases, we approach the limit of single asset
in which the gain probability is small than the loss probability but the conditional
expectation of gain is much higher than the conditional expectation of loss.
110
Bibliography
[1] Al-Naffouri T. Y. Babak H. (2009), On the Distribution of Indenite
Quadratic Forms in Gaussian Random Variables, Information Theory, pp. 1744
- 1748 .
[2] Davies R. B.(1973), Numerical Inversion of Characteristic Function,
Biometrika, 60, pp. 415-417.
[3] Davies R. B. (1980), The Distribution of a Linear Combination of
2
Random
Variables, Applied Statistics, 29, pp. 323-333.
[4] Imhoff J. P.(1961), Computing the Distribution of Quadratic Form in Normal
variables, Biometrika, 48, pp. 419-426.
[5] Khatri C. G.(1978), A remark on the necessary and sucient conditions for a
quadratic form to be distributed as a chi-square, Biometrika, 65, pp. 239-240.
[6] Kotz S., Johnson N.L. and Boyd D.W. (1967), Series Representations of
Distributions of Quadratic Forms in Normal Variables II. Non-Central Case,
The Annals of Mathematical Statistics, 38, pp. 838-848.
[7] Murison R. (2005), Distribution theory and inference, School of Science and
Technology , ch. 6, pp. 86-88.
[8] Ruben H.(1962), Probability Content of Regions Under Spherical Normal
Distributions, IV: The Distribution of Homogeneous and Non-Homogeneous
Quadratic Functions of Normal Variables, The Annals of Mathematical Statis-
tics, 33, pp. 542-570.
[9] Ruben H.(1962), A New Result on the Distribution of Quadratic Forms, The
Annals of Mathematical Statistics, 34, pp. 1582-1584.
[10] Shah B.K. (1963) Distribution of Denite and of Indenite Quadratic Forms
from a Non- Central Normal Distribution, The Annals of Mathematical Statis-
tics, 34, pp. 186-190.
[11] Shah B.K. and Khatri C.G. (1961) Distribution of a Denite Quadratic Form
for Non-Central Normal Variates, The Annals of Mathematical Statistics, 32,
pp. 883-887.
111
[12] Tziritas G. G.(1987), On the Distribution of Positive-denite Gaussian
Quadratic Forms, IEEE Transtractions on Information Theory, 33, pp. 895-
906.
112
Conclusions
Within the internship in the R&D team of Lyxor Asset Management, I had chance
to work on many interesting topics concerning the quantitative asset management.
Beyond of this report, the resutls obtained during the stay have been employed for
the 8
th
edition of the Lyxor White Paper communication. The main results of this
intership can be divided into three grand lines. The rst results consists of improving
the trend and volatility estimations which are important quantities for implementing
dynamical strategies. The second main results concern the application of the machine
learning technology in nance. We expect to employ the Support vector machine
for forcasting the expected return of nancial assets and for having a criterial for
stock selection. The third main result is devoted for the analysis of the performance
of trend-following strategy (CTA) in the general case. It consists of studying the
eciency of CTA within the changes in the market such as the correlation between
the assets, or their performance.
In the rst part, we focused on improving the trend and volatility estimations in
order to implement two crucial momentum-strategies: trend-following and voltarget.
We show that we can use L
1
lters to forecast the trend of the market in a very
simple way. We also propose a cross-validation procedure to calibrate the optimal
regularization parameter where the only information to provide is the investment
time horizon. More sophisticated models based on a local and global trends is also
discussed. We remark that these models can reect the eect of mean-reverting to
the global trend of the market. Finally, we consider several backtests on the S&P 500
index and obtain competing results with respect to the traditional moving-average
lter. On another hand, voltarget strategies are ecient ways to control the risk
for building trading strategies. Hence, a good estimator of the volatility is essential
from this perspective. In this report, we present the improvement on the forecasting
of volatility by using some novel technologies. The use of high and low prices is
less important for the index as it gives more and less the same result with tradi-
tional close-to-close estimator. However, for independent stock with higher volatility
level, the high-low estimators improves the prediction of volatility. We consider sev-
eral backtests on the S&P 500 index and obtain competing results with respect to
the traditional moving-average estimator of volatility. Indeed, we consider a simple
stochastic volatility model which permit to integrate the dynamics of the volatility in
the estimator. An optimization scheme via the maximum likehood algorithm allows
us to obtain dynamically the optimal averaging window. We also compare these
results for range-based estimator with the well-known IGARCH model. The com-
parison between the optimal value of the likehood functions for various estimators
gives us also a ranking of estimation error. Finally, we studied the high frequency
volatility estimator which is a very active topic of nancial mathematics. Using sim-
ple model proposed by Zhang et al, (2005), we show that the microstructure noise
can be eliminated by the two time scale estimator.
Support vector machine is a well-established method with a very wide use in
various domain. In the nancial point of view, this method can be used to recognize
and to predict the high performance stocks. SVM is a good indicator to build e-
cient trading strategies over a stocks universe. Within the second part of this report,
we rst have revisited the basic idea of SVM in both classication and regression
contexts. The extension to the case of multi-classication is also discussed in detail.
Various applications of this technique were introduced and discussed in detail. The
rst class of applications is to employ SVM as forecasting method for time-series.
We proposed two applications: the rst one consists of using SVM as a signal lter.
The advantage of the method is that we can calibrate the model parameter by using
only the available data. The second application is to employ SVM as a multi-factor
regression technique. It allows to rene the prediction with additional inputs such as
economic factors. For the second class of applications, we deal with SVM classica-
tion. Two main applications that we discussed in the scope of this paper are the score
construction and the sector recognition. Both resulting information are important
to build momentum strategies which play an important role in Lyxor quantitative
management.
Finally, we have realized a detailled analysis on the performance of trend-following
strategy in order to understand its important role in the risk diversication and in
optimizing the absolute return. In the third part, we studied the impact of estimator
error and market parameters such as the correlation and the average performance
of individual stocks on a trend-following strategy both in the single asset and multi-
asset cases. The objective of this chapter is two-fold. First, we have establish the
general framework for analyzing a CTA fund. Second, we illustrate important results
of the trading impact on CTA strategy via a simple toy model . We have shown
that the gain probability and gain expectation depend strongly on the correlation
and the number of assets. Increasing the number of asset can help to improve the
performance and reduce the risk (volatility) within a momentum strategy. However,
when the number of asset reaches certain limit, we observe a saturation of perfor-
mance. It implies that above this limit, putting more assets does not improve the
performance very much but it does make the strategy more complicate and increase
the management cost as the portfolio is rebalanced frequently. The correlation be-
tween assets play an important role as well. As usual, the higher correlation level
is, the less ecient strategies are. Interestingly, we remark that when the correla-
tion increases, we approach the limit of single asset in which the gain probability is
smaller than the loss probability but the conditional expectation of gains is much
higher than the conditional expectation of losses.
114
Appendix A
Appendix of chaper 1
A.1 Computational aspects of L
1
, L
2
lters
A.1.1 The dual problem
The L
1
T lter
This problem can be solved by considering the dual problem which is a QP program.
We rst rewrite the primal problem with new variable z = Dx:
min
1
2
|y x|
2
2
+|z|
1
u.c. z = Dx
We construct now the Lagrangian function with the dual variable R
n2
:
/(x, z, ) =
1
2
|y x|
2
2
+|z|
1
+
(Dx z)
The dual objective function is obtained in the following way:
inf
x,z
/(x, z, ) =
1
2
DD
+y
for 1 1. According to the Kuhn-Tucker theorem, the initial problem is

equivalent to the dual problem:
min
1
2
DD
u.c. 1 1
This QP program can be solved by traditional Newton algorithm or by interior-point
methods, and the nal solution of the trend reads
x
= y D
115
The L
1
C lter
The optimization procedure for L
1
C lter follows the same strategy as the L
1
T
lter. We obtain the same quadratic program with the D operator replaced by
(n 1) n matrix which is the discrete version of the rst order derivative:
D =
_
_
1 1 0
0 1 1 0
.
.
.
1 1 0
1 1
_
_
The L
1
TC lter
In order to follow the same strategy presented above, we introduce two additional
variables z
1
= D
1
x and z
2
= D
2
x. The initial problem becomes:
min
1
2
|y x|
2
2
+
1
|z
1
|
1
+
2
|z
2
|
1
u.c.
_
z
1
= D
1
x
z
2
= D
2
x
The Lagrangian function with the dual variables
1
R
n1
and
2
R
n2
is:
/(x, z
1
, z
2
,
1
,
2
) =
1
2
|y x|
2
2
+
1
|z
1
|
1
+
2
|z
2
|
1
+
1
(D
1
x z
1
)+
2
(D
2
x z
2
)
whereas the dual objective function is:
inf
x,z
1
,z
2
/(x, z
1
, z
2
,
1
,
2
) =
1
2
_
_
_D
1

1
+D
2

2
_
_
_
2
2
+y
_
D
1

1
+D
2

2
_
for
i
1
i

i
1 (i = 1, 2). Introducing the variable z = (z
1
, z
2
) and = (
1
,
2
),
the initial problem is equivalent to the dual problem:
min
1
2
Q R
u.c.
+

+
with D =
_
D
1
D
2
_
, Q = DD
, R = Dy and
+
=
_

1
2
_
1. The solution of the
primal problem is then given by x
= y D
.
The L
1
T multivariate lter
As in the univariate case, this problem can be solved by considering the dual problem
which is a QP program. The primal problem is:
min
1
2
m
i=1
_
_
_y
(i)
x
_
_
_
2
2
+|z|
1
u.c. z = Dx
116
Let us dene y = ( y
t
) with y
t
= m
1
m
i=1
y
(i)
. The dual objective function becomes:
inf
x,z
/(x, z, ) =
1
2
DD
+ y
+
1
2
m
i=1
_
y
(i)
y
_
_
y
(i)
y
_
for 1 1. According to the Kuhn-Tucker theorem, the initial problem is
equivalent to the dual problem:
min
1
2
DD
u.c. 1 1
This QP program can be solved by traditional Newton algorithm or by interior-point
methods and the solution is:
x
= y D
A.1.2 The interior-point algorithm

We present briey the interior-point algorithm of Boyd and Vandenberghe (2009) in
the case of the following optimization problem:
min f
0
(x)
u.c.
_
Ax = b
f
i
(x) < 0 for i = 1, . . . , m
where f
0
, . . . , f
m
: R
n
R are convex and twice continuously dierentiable and
rank (A) = p < n. The inequality constraints will become implicit if one rewrite the
problem as:
min f
0
(x) +
m
i=1
J
(f
i
(x))
u.c. Ax = b
where J
(u) : R R is the non-positive indicator function

1
. This indicator func-
tion is discontinuous, hence the Newton method can not be applied. In order to
overcome this problem, we approximate J
(u) by the logarithmic barrier function

J
(u) =
1
ln (u) with . Finally the Kuhn-Tucker condition for this
approximation problem gives r
t
(x, , ) = 0 with:
r
(x, , ) =
_
_
f
0
(x) +f (x)
+A
diag () f (x)
1
1
Ax b
_
_
1
We have:
I
(u) =
0 u 0
u > 0
117
The solution of r
(x, , ) = 0 can be obtained by Newtons iteration for the triple

y = (x, , ):
r
(y + y) r
(y) +r
(y) y = 0
This equation gives the Newtons step y = r
(y)
1
r
(y) which denes the

search direction.
A.1.3 The scaling of smoothing parameter of L
1
lter
We can try to estimate the order of magnitude of the parameter
max
by considering
the continuous case. Assuming that the signal is a process W
t
. The value of
max
in
the discrete case dened by:
max
=
_
_
_
_
_
DD
_
1
Dy
_
_
_
_
can be considered as the rst primitive I

1
(T) =
_
T
0
W
t
dt of the process W
t
if D = D
1
(L
1
C ltering) or the second primitive I
2
(T) =
_
T
0
_
t
0
W
s
ds dt of W
t
if D = D
2
(L
1
T ltering). We have:
I
1
(T) =
_
T
0
W
t
dt
= W
T
T
_
T
0
t dW
t
=
_
T
0
(T t) dW
t
The process I
1
(T) is a Wiener integral (or a Gaussian process) with variance:
E
_
I
2
1
(T)
=
_
T
0
(T t)
2
dt =
T
3
3
In this case, we expect that
max
T
3/2
. The second order primitive can be calcu-
lated in the following way:
I
2
(T) =
_
T
0
I
1
(t) dt
= I
1
(T) T
_
T
0
t dI
1
(T)
= I
1
(T) T
_
T
0
tW
t
dt
= I
1
(T) T
T
2
2
W
T
+
_
T
0
t
2
2
dW
t
=
T
2
2
W
T
+
_
T
0
_
T
2
Tt +
t
2
2
_
dW
t
=
1
2
_
T
0
(T t)
2
dW
T
118
This quantity is again a Gaussian process with variance:
E[I
2
2
(T)] =
1
4
_
T
0
(T t)
4
dt =
T
5
20
max
T
5/2
.
A.1.4 Calibration of the L
2
lter
We discuss here how to calibrate the L
2
lter in order to extract the trend with
respect to the investment time horizon T. Though the L
2
lter admits an explicit
solution which is a great advantage for numerical implementation, the calibration of
the smoothing parameter is not trivial. We propose to calibrate the L
2
lter by
comparing the spectral density of this lter with the one obtained with the moving-
average lter. For this last lter, we have:
x
MA
t
=
1
T
t1
i=tT
y
i
It comes that the spectral density is:
f () =
1
T
2
T1
t=0
e
it
2
For the L
2
lter, we k now that the solution is x
HP
=
_
1 + 2D
T
D
_
1
y. Therefore,
the spectral density is:
f
HP
() =
_
1
1 + 4(3 4 cos + cos 2)
_
2
_
1
1 + 2
4
_
2
The width of the spectral density for the L
2
lter is then (2)
1/4
whereas it is 2T
1
for the moving-average lter. Calibrate the L
2
lter could be done by matching this
two quantities. Finally, we obtain the following relationship:

=
1
2
_
T
2
_
4
In Figure A.1, we represent the spectral density of the moving-average lter for dif-
ferent windows T. We report also the spectral density of the corresponding L
2
lters.
For that, we have calibrated the optimal parameter
by least square minimization.

In Figure A.2, we compare the optimal estimator
with the one corresponding to

10.27
. We notice that the approximation is very good.

119
Figure A.1: Spectral density of moving-average and L
2
lters
Figure A.2: Relationship between the value of and the length of the moving-average
lter
120
A.1.5 Implementation issues
The computational time may be large when working with dense matrices even if
we consider interior-point algorithms. It could be reduced by using sparse matrices.
But the ecient way to optimize the implementation is to consider band matrices.
Moreover, we may also notice that we have to solve a large linear system at each
iteration. Depending on the ltering problem (L
1
T, L
1
C and L
1
TC lters),
the system is 6-bands or 3-bands but always symmetric. For computing
max
, one
may remark that it is equivalent to solve a band system which is positive denite.
We suggest to adapt the algorithms in order to take into account all these properties.
121
Appendix B
Appendix of chapter 2
B.1 Estimator of volatility
B.1.1 Estimation with realized return
We consider only one return, then the estimator of volatility can be obtained as
following:
R
2
t
i
=
_
ln S
t
i
ln S
t
i1
_
2
=
_
_
t
i
t
i1
u
dW
u
+
_
t
i
t
i1
u
du
1
2
2
u
du
_
2
The conditional expectation with respect to the couple
u
and
u
which are supposed
to be independent to dW
u
is given by:
E
_
R
2
t
i
[,
=
_
t
i
t
i1
2
u
du +
_
_
t
i
t
i1
u
du
1
2
2
u
du
_
2
which is approximatively equal to:
(t
i
t
i1
)
2
t
i1
+ (t
i
t
i1
)
2
_
t
i1

1
2
2
t
i1
_
2
The variance of this estimator characterizes the error and reads:
var
_
R
2
t
i
[,
_
= var
_
_
_
_
_
t
i
t
i1
u
dW
u
+
u
du
1
2
2
u
du
_
2
,
_
_
As the conditional expectation of
_
t
i
t
i1
_
u
dW
u
+
u
du
1
2
2
u
du
_
with respect to
et is a Gaussian variable of mean value
_
t
i
t
i1
_
u
du
1
2
2
u
du
_
and variance
_
t
i
t
i1
2
u
du. Therefore, we obtain the variance of the estimator:
var
_
R
2
t
i
[,
_
= 2
_
_
t
i
t
i1
2
u
du
_
2
+4
_
_
t
i
t
i1
2
u
du
__
_
t
i
t
i1
u
du
1
2
2
u
du
_
2
(B.1)
123
which is approximatively equal to:
2 (t
i
t
i1
)
2
4
t
i1
+ 4 (t
i
t
i1
)
3
2
t
i1
_
u
du
1
2
2
u
du
_
2
We remark that when the time step (t
i
t
i1
) becomes small, the estimator becomes
unbiased with its standard deviation

2 (t
i
t
i1
)
2
t
i1
. This error is directly pro-
portional to the quantity to be estimated.
In order to estimate the average variance between t
0
and t
n
or the approached
volatility at t
n
, we can employ the canonical estimator
n
i=1
R
2
t
i
=
n
i=1
_
ln S
t
i
ln S
t
i1
_
2
The expectation value of this estimator reads
E
_
n
i=1
R
2
t
i
,
_
=
_
t
n
t
0
2
u
du +
n
i=1
_
_
t
i
t
i1
u
du
1
2
2
u
du
_
2
We observe that his estimator is weakly biased, however this eect is totally neg-
ligible. If we consider a volatility of 20% with a trend of 10%, the estimation of
volatility is 20.006% instead of 20%.
The variance of the canonical estimator (estimation error) reads:
n
i=1
2
_
_
t
i
t
i1
2
u
du
_
2
+ 4
_
_
t
i
t
i1
2
u
du
__
_
t
i
t
i1
u
du
1
2
2
u
du
_
2
which can be roughly estimated by:
n
i=1
2
_
_
t
i
t
i1
2
u
du
_
2
2
4
n
i=1
(t
i
t
i1
)
2
If the recorded time t
i
are regularly distributed with time-spacing t, then we have:
var
_
n
i=1
R
2
t
i
,
_
2
4
(t
n
t
0
)
124
Appendix C
Appendix of chapter 3
C.1 Dual problem of SVM
In the traditional approach, the SVM problem is rst mapped to the dual problem
then is solved by a QP program. We present here the detail derivation of the dual
problem in both hard-margin SVM and soft-margin SVM case.
C.1.1 Hard-margin SVM classier
Let us start rst with the hard-margin SVM problem for the classication:
min
w,b
1
2
|w|
2
u.c. y
i
_
w
T
x
i
+b
_
1 i = 1...n
In order to get the dual problem, we construct the Lagrangian for inequality con-
strains by introducing positive Lagrange multipliers = (
1
, . . . ,
i
) 0:
L(w, b, ) =
1
2
|w|
2
i=1
i
y
i
_
w
T
x
i
+b
_
+
n
i=1
i
In minimizing the Lagrangian with respect to (w, b), we obtain the following equa-
tions:
L
w
T
= w
n
i=1
i
y
i
x
i
= 0
L
b
=
n
i=1
i
y
i
= 0
Insert these results into the Lagrangian, we obtain the dual objective L
D
function
with respect to the variable w:
L
D
() =
T
1
1
2
T
D
125
with D
ij
= y
i
y
j
x
T
i
x
j
and the constrains
T
y = 0 and 0. Thank to the
KKT theorem, the initial optimization problem is equivalent to maximizing the dual
objective function L
D
()
max
T
1
1
2
T
D
u.c.
T
y = 0, 0
C.1.2 Soft-margin SVM classier
We turn now to the soft-margin SVM classier with L
1
constrain case F (u) = u, p =
1. We rst write down the primal problem:
min
w,b,
1
2
|w|
2
+C.F
_
n
i=1
p
i
_
u.c. y
i
_
w
T
x
i
+b
_
1
i
,
i
0 i = 1...n
For both case, we construct Lagrangian by introducing the couple of Lagrange mul-
tiplier (, ) for 2n constraints.
L(w, b, , ) =
1
2
|w|
2
+C.F
_
n
i=1
i
_
i=1
i
_
y
i
_
w
T
x
i
+b
_
1 +
i
_
i=1
i
with the following constraints on the Lagrange multipliers 0 and 0. Mini-
mizing the Lagrangian with respect to (w, b, ) gives us:
L
w
T
= w
n
i=1
i
y
i
x
i
= 0
L
b
=
n
i=1
i
y
i
= 0
L
= C = 0
with inequality constraints 0 and 0. Insert these results into the Lagrangian
leads to the dual problem:
max
T
1
1
2
T
D (C.1)
u.c.
T
y = 0, 0 C1
126
C.1.3 -SV regression
We study here the -SV regression. We rst write down the primal problem with all
constrains:
min
w,b,
1
2
|w|
2
+C
_
n
i=1
i
_
u.c. w
T
x
i
+b y
i
+
i
y
i
w
T
x
i
b +
i
0
i
0 i = 1...n
In this case, we have 4n inequality constrain. Hence, we construct Lagrangian by
introducing the positive Lagrange multipliers (,
, ,
). The Lagrangian of this

primal problem reads:
L
_
w, b, ,
,
_
=
1
2
|w|
2
+C.F
_
n
i=1
i
_
i=1
i=1
i=1
i
_
w
T
(x
i
) +b y
i
+ +
i
_
i=1
i
_
w
T
(x
i
) b +y
i
+ +
i
_
with = (
i
)
i=1...n
,
= (
i
)
i=1...n
and the following constraints on the Lagrange
multipliers ,
, ,
0. Minimizing the Lagrangian with respect to (w, b, )

gives us:
L
w
T
= w
n
i=1
(
i
i
) y
i
x
i
= 0
L
b
=
n
i=1
(
i
i
) y
i
= 0
L
= CI = 0
L
= CI
= 0
Insert these results into the Lagrangian leads to the dual problem:
max
,
_
T
y
_
+
_
T
1
1
2
_
_
T
K
_
_
(C.2)
u.c.
_
_
T
1 = 0, 0 ,
C1
When = 0, the term (+
)
T
1 in the objective function disappears, then we
can reduce the optimization problem by changing variable (
) . The
inequality constrain for new variable reads [[ < CI.
127
The dual problem can be solved by the QP program which gives the optimal
solution
. In order to compute b, we use the KKT condition:
i
_
w
T
(x
i
) +b y
i
+ +
i
_
= 0
i
_
y
i
w
T
(x
i
) b + +
i
_
= 0
(C
i
)
i
= 0
(C
i
)
i
= 0
We remark that the two last conditions give us:
i
= 0 for 0 <
i
< C and
i
= 0
for 0 <
i
< C. This result implies direclty the following condition for all support
vectors of training set (x
i
, y
i
):
w
T
(x
i
) +b y
i
= 0
We denote here SV the set of support vectors. Using the condition w =
n
i=1
(
i
i
) (x
i
)
and averaging over the training set, we obtain nally:
b =
1
n
SV
n
SV
i
(y
i
(z)
i
) = 0
with z = K (
).
C.2 Newton optimization for the primal problem
We consider here the Newton optimization scheme for solving the unconstrainted
primal problem:
min
,b
L
P
(, b) = min
,b
1
2
T
K +C
n
i=1
L
_
y
i
, K
T
i
+b
_
The required condition of this scheme is that the function L(y, t) is dierentiable.
We study rst the case of quadratic loss where L(y, t) is dierentiable then the case
with soft-margin where we have to regularize L(y, t).
C.2.1 Quadratic loss function
For the quadratic loss case, the penalty function has a suitable form:
L(y
i
, f (x
i
)) = max (0, 1 y
i
f (x
i
))
2
This function is dierentiable everywhere and its derivative reads:
L
t
(y, t) = 2y (yt 1) I
{yt1}
However, the second derivative is not dened at the point yt = 1. In order to avoid
this problem, we consider directly the function L as a function of the vector and
128
perform a quasi-Newton optimization. The second derivative now is replaced by an
approximation of the Hessian matrix. The gradient of the objective function with
respect to the vector (b)
T
is given as following:
L
P
=
_
2C1
T
I
0
1 2C1
T
I
0
K
2CK
T
I
0
1 K +CKI
0
K
__
b
_
2C
_
1
T
I
0
y
KI
0
y
_
and the pseudo-Hessian matrix is given by:
H =
_
2C1
T
I
0
1 2C1
T
I
0
K
2CKI
0
1 K + 2CKI
0
K
_
Then the Newton iteration consists of updating the vector (b)
T
until convergence
as following:
_
b
_
b
_
H
1
L
P
C.2.2 Soft-margin SVM
For the soft-margin case, the penalty function has the following form
L(y
i
, f (x
i
)) = max (0, 1 y
i
f (x
i
))
which requires a regularization. A dierentiable approximation is to use the following
penalty function:
L(y, t) =
_
_
0 if yt > 1 +h
(1+hyt)
2
4h
if [1 yt[ h
1 yt if yt < 1 h
129
Published paper in the Lyxor White Paper Series:
Trend Filtering Methods For
Momentum Strategies
Lyxor White Paper Series, Issue # 8, December 2011
http://www.lyxor.com/fr/publications/white-papers/wp/52/
WH I T E PA P E R
T R E N D F I LT E R I N G
ME T H O D S F O R
MO ME N T U M S T R AT E G I E S
December
2011
I ssue #8
Benjamin Bruder
Research & Development
Lyxor Asset Management, Paris
benjamin.bruder@lyxor.com
Jean-Charles Richard
jean-charles.richard@lyxor.com
Tung-Lam Dao
tung-lam.dao@lyxor.com
Thierry Roncalli
thierry.roncalli@lyxor.com
712100_215829_ white paper 8 lot 1.indd 1 13/12/11 16:08
712100_215829_ white paper 8 lot 1.indd 2 13/12/11 16:08
QUANT RESEARCH BY LY XOR
1
TREND FI LTERI NG METHODS FOR MOMENTUM STRATEGI ES Issue # 8
Foreword
The widespread endeavor to identify trends in market prices has given rise to a signif-
icant amount of literature. Elliott Wave Principles, Dow Theory, Business cycles, among
many others, are common examples of attempts to better understand the nature of market
prices trends.
Unfortunately this literature often proves frustrating. In their attempt to discover new
rules, many authors eventually lack precision and forget to apply basic research methodology.
Results are indeed often presented without any reference neither to necessary hypotheses nor
to condence intervals. As a result, it is dicult for investors to nd there rm guidance
and to dierentiate phonies from the real McCoy.
This said, attempts to dierentiate meaningful information from exogenous noise lie at
the core of modern Statistics and Time Series Analysis. Time Series Analysis follows similar
goals as the above mentioned approaches but in a manner which can be tested. Today more
than ever, modern computing capacities can allow anybody to implement quite powerful
tools and to independently tackle trend estimation issues. The primary aim of this 8
th
White Paper is to act as a comprehensive and simple handbook to the most
widespread trend measurement techniques.
Even equipped with rened measurement tools, investors have still to remain wary about
their representation of trends. Trends are sometimes thought about as some hidden force
pushing markets up or down. In this deterministic view, trends should persist.
However, random walks also generate trends! Five reds drawn in a row from a non
biased roulette wheel do not give any clue about the next drawn color. It is just a past trend
with nothing to do with any underlying structure but a mere succession of independent
events. And the bottom line is that none of those two hypotheses can be conrmed or
dismissed with certainty.
As a consequence, overtting issues constitute one of the most serious pitfalls in applying
trend ltering techniques in nance. Designing eective calibration procedures reveals to be
as important as the theoretical knowledge of trend measurement theories. The practical
use of trend extraction techniques for investment purposes constitutes the other
topic addressed in this 8
th
White Paper.
Nicolas Gaussel
Global Head of Quantitative Asset Management
712100_215829_ white paper 8 lot 1.indd Sec1:1 13/12/11 16:08
2
3
Executive Summary
Introduction
The ecient market hypothesis implies that all available information is reected in current
prices, and thus that future returns are unpredictable. Nevertheless, this assumption has
been rejected in a large number of academic studies. It is commonly accepted that nancial
assets may exhibit trends or cycles. Some studies cite slow-moving economic variables related
to the business cycle as an explanation for these trends. Other research argues that investors
are not fully rational, meaning that prices may underreact in the short run and overreact at
long horizons.
Momentum strategies try to benet from these trends. There are two opposing types:
trend following and contrarian. Trend following strategies are momentum strategies in which
an asset is purchased if the price is rising, while in the contrarian strategy assets are sold
if the price is falling. The rst step in both strategies is trend estimation, which is the
focus of this paper. After a review of trend ltering techniques, we address practical issues,
depending on whether trend detection is designed to explain the past or forecast the future.
The principles of trend ltering
In time series analysis, the trend is considered to be the component containing the global
change, which contrasts with local changes due to noise. The separation between trend and
noise has a long mathematical history, and continues to be of great interest to the scientic
community. There is no precise denition of the trend, but it is generally accepted that it
is a smooth function representing long-term movement. Thus, trends should exhibit slow
change, while noise is assumed to be highly volatile.
The simplest trend ltering method is the moving average lter. On average, the noisy
parts of observations tend to cancel each other out, while the trend has a cumulative nature.
But observations can be averaged using many dierent types of weightings. More generally,
the dierent averages obtained are referred to as linear ltering. Several examples repre-
senting trend ltering for various linear lters are shown in Figure 1. In this example, the
averaging horizon (65 business days or one year) has much more inuence than the type of
averaging.
Other trend following methods, which are classied as nonlinear, use more complex
calculations to obtain more specic results (such as lters based on wavelet analysis, support
vector machines or singular spectrum analysis). For instance, the L
1
lter is designed to
obtain piecewise constant trends, which can be interpreted more easily.
4
Figure 1: Trend estimate of the S&P 500 index
Variations around a benchmark estimator
Trend ltering can be performed either to explain past behaviour of asset prices, or to
forecast future returns. The choice of the estimator and its calibration primarily depend
on that objective. If the goal is to explain past price behaviour, there are two possible
approaches. The rst is to select the model and parameters that minimise past prediction
error. This can be performed using a cross-validation procedure, for example. The second
option is to consider a benchmark estimator, such as the six-month moving average, and to
calibrate another model to be as close to the benchmark as possible. For instance, the L
1
lter of Figure 2 is calibrated to deliver a constant trend over an average six-month period.
This type of lter is more easily interpreted than the original six-month moving average,
with clearly delimited trend periods. This procedure can be performed on any time series.
From trend ltering to forecasting
Trend ltering may also be a predictive tool. This is a much more ambitious objective.
It supposes that the last observed trend has an inuence on future asset returns. More
precisely, trend following predictions suppose that positive (or negative) trends are more
likely to be followed by positive (or negative) returns. Any trend following method would
be useless if this assumption did not hold.
Figure 3 illustrates that the distributions of the one-month GSCI index returns after
a very positive three-month trend (i.e. above a threshold) clearly dominate the return
distribution after a very negative trend (i.e. below the threshold).
5
Figure 2: L
1
versus moving average ltering
Figure 3: Distribution of the conditional standardised monthly return
6
Furthermore, this persistence eect is also tested in Table 1 for a number of major
nancial indices. This table compares the average one-month return following a positive
three-month trend period to the average one-month return following a negative three month
trend period.
Table 1: Average one-month conditional return based on past trends
Trend Positive Negative Dierence
Eurostoxx 50 1.1% 0.2% 0.9%
S&P 500 0.9% 0.5% 0.4%
MSCI WORLD 0.6% 0.3% 1.0%
MSCI EM 1.9% 0.3% 2.2%
TOPIX 0.4% 0.4% 0.9%
EUR/USD 0.2% 0.2% 0.4%
USD/JPY 0.2% 0.2% 0.4%
GSCI 1.3% 0.4% 1.6%
On average, for all indices under consideration, returns are higher after a positive trend than
after a negative one. Thus, the trends are persistent, and seem to have a predictive value.
This makes the case for the study of trend following strategies, and highlights the appeal of
trend ltering methods.
Conclusion
The ultimate goal of trend ltering in nance is to design portfolio strategies that may benet
from the identied trends. Such strategies must rely on appropriate trend estimators and
time horizons. This paper highlights the variety of estimators available in the academic
literature. But the choice of trend estimator is just one of the many questions that arises
in the denition of those strategies. In particular, diversication and risk budgeting are key
aspects of success.
7
Table of Contents
1 Introduction 9
2 A review of econometric estimators for trend ltering 10
2.1 The trend-cycle model . . . . . . . . . . . . . . . . . . . . . . . 10
2.2 Linear ltering . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3 Nonlinear ltering . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.4 Multivariate ltering . . . . . . . . . . . . . . . . . . . . . . . . 27
3 Trend ltering in practice 30
3.1 The calibration problem . . . . . . . . . . . . . . . . . . . . . . 30
3.2 What about the variance of the estimator? . . . . . . . . . . . . 33
3.3 From trend ltering to trend forecasting . . . . . . . . . . . . . 38
4 Conclusion 40
A Statistical complements 41
A.1 State space model and Kalman ltering . . . . . . . . . . . . . . 41
A.2 L
1
ltering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
A.3 Wavelet analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
A.4 Support vector machine . . . . . . . . . . . . . . . . . . . . . . 47
A.5 Singular spectrum analysis . . . . . . . . . . . . . . . . . . . . . 50
8
9
Trend Filtering Methods
for Momentum Strategies
Benjamin Bruder
benjamin.bruder@lyxor.com
Tung-Lam Dao
tung-lam.dao@lyxor.com
Jean-Charles Richard
jean-charles.richard@lyxor.com
Thierry Roncalli
thierry.roncalli@lyxor.com
December 2011
Abstract
This paper studies trend ltering methods. These methods are widely used in mo-
mentum strategies, which correspond to an investment style based only on the history
of past prices. For example, the CTA strategy used by hedge funds is one of the
best-known momentum strategies. In this paper, we review the dierent econometric
estimators to extract a trend of a time series. We distinguish between linear and non-
linear models as well as univariate and multivariate ltering. For each approach, we
provide a comprehensive presentation, an overview of its advantages and disadvantages
and an application to the S&P 500 index. We also consider the calibration problem of
these lters. We illustrate the two main solutions, the rst based on prediction error,
and the second using a benchmark estimator. We conclude the paper by listing some
issues to consider when implementing a momentum strategy.
Keywords: Momentum strategy, trend following, moving average, ltering, trend extrac-
tion.
JEL classication: G11, G17, C63.
1 Introduction
The ecient market hypothesis tells us that nancial asset prices fully reect all available
information (Fama, 1970). One consequence of this theory is that future returns are not
predictable. Nevertheless, since the beginning of the nineties, a large body of academic
research has rejected this assumption. One of the arguments is that risk premiums are time
varying and depend on the business cycle (Cochrane, 2001). In this framework, returns
on nancial assets are related to some slow-moving economic variables that exhibit cyclical
patterns in accordance with the business cycle. Another argument is that some agents are
We are grateful to Guillaume Jamet and Hoang-Phong Nguyen for their helpful comments.
10
not fully rational, meaning that prices may underreact in the short run but overreact at long
horizons (Hong and Stein, 1997). This phenomenon may be easily explained by the theory
of behavioural nance (Barberis and Thaler, 2002).
Based on these two arguments, it is now commonly accepted that prices may exhibit
trends or cycles. In some sense, these arguments chime with the Dow theory (Brown et al.,
1998), which is one of the rst momentum strategies. A momentum strategy is an investment
style based only on the history of past prices (Chan et al., 1996). We generally distinguish
between two types of momentum strategy:
1. the trend following strategy, which consists of buying (or selling) an asset if the esti-
mated price trend is positive (or negative);
2. the contrarian (or mean-reverting) strategy, which consists of selling (or buying) an
asset if the estimated price trend is positive (or negative).
Contrarian strategies are clearly the opposite of trend following strategies. One of the tasks
involved in these strategies is to estimate the trend, excepted when based on mean-reverting
processes (see DAspremont, 2011). In this paper, we provide a survey of the dierent
trend ltering methods. However, trend ltering is just one of the diculties in building a
momentum strategy. The complete process of constructing a momentum strategy is highly
complex, especially as regards transforming past trends into exposures an important factor
that is beyond the scope of this paper.
The paper is organized as follows. Section two presents a survey of the dierent econo-
metric trend estimators. In particular, we distinguish between methods based on linear
ltering and nonlinear ltering. In section three, we consider some issues that arise when
trend ltering is applied in practice. We also propose some methods for calibrating trend
ltering models and highlight the problem of estimator variance. Section four oers some
concluding remarks.
2 A review of econometric estimators for trend ltering
Trend ltering (or trend detection) is a major task of time series analysis from both a
mathematical and nancial viewpoint. The trend of a time series is considered to be the
component containing the global change, which contrasts with local changes due to noise.
The trend ltering procedure concerns not only the problem of denoising; it must also
take into account the dynamics of the underlying process. This explains why mathematical
approaches to trend extraction have a long history, and why this subject is still of great
interest to the scientic community
1
. From an investment perspective, trend ltering is
fundamental to most momentum strategies developed in asset management and hedge funds
sectors in order to improve performance and limit portfolio risks.
2.1 The trend-cycle model
In economics, trend-cycle decomposition plays an important role by identifying the perma-
nent and transitory stochastic components in a non-stationary time series. Generally, the
permanent component can be interpreted as a trend, whereas the transitory component may
1
See Alexandrov et al. (2008).
11
be a noise or a stochastic cycle. Let y
t
be a stochastic process. We assume that y
t
is the
sum of two dierent unobservable parts:
y
t
= x
t
+
t
where x
t
represents the trend and
t
is a stochastic (or noise) process. There is no precise
denition for trend, but it is generally accepted to be a smooth function representing long-
term movements:
[...] the essential idea of trend is that it shall be smooth. (Kendall, 1973).
It means that changes in the trend x
t
must be smaller than those of the process y
t
. From a
statistical standpoint, it implies that the volatility of y
t
y
t1
is higher than the volatility
of x
t
x
t1
:
(y
t
y
t1
) (x
t
x
t1
)
One of the major problems in nancial econometrics is the estimation of x
t
. This is the
subject of signal extraction and ltering (Pollock, 2009).
Finite moving average ltering for trend estimation has a long history. It has been used
in actuarial science since the beginning of the twentieth century
2
. But the modern theory of
signal ltering has its origins in the Second World War and was formulated independently
by Norbert Wiener (1941) and Andrei Kolmogorov (1941) in two dierent ways. Wiener
worked principally in the frequency domain whereas Kolmogorov considered a time-domain
approach. This theory was extensively developed in the fties and sixties by mathematicians
and statisticians such as Hermann Wold, Peter Whittle, Rudolf Kalman, Maurice Priestley,
George Box, etc. In economics, the problem of trend ltering is not a recent one, and may
date back to the seminal article of Muth (1960). It was extensively studied in the eighties and
nineties in the literature on business cycles, which led to a vast body of empirical research
being carried out in this area
3
. However, it is in climatology that trend ltering is most
extensively studied nowadays. Another important point is that the development of ltering
techniques has evolved according to the development of computational power and the IT
industry. The Savitzky-Golay smoothing procedure may appear very basic today though it
was revolutionary
4
when it was published in 1964.
In what follows, we review the class of ltering techniques that is generally used to
estimate a trend. Moving average lters play an important role in nance. As they are very
intuitive and easy to implement, they undoubtedly represent the model most commonly used
in trading strategies. The moving average technique belongs to the class of linear lters,
which share a lot of common properties. After studying this class of lters, we consider
some nonlinear ltering techniques, which may be well suited to solving nancial problems.
2.2 Linear ltering
2.2.1 The convolution representation
We denote by y = . . . , y
2
, y
1
, y
0
, y
1
, y
2
, . . . the ordered sequence of observations of the
process y
t
. Let x
t
be the estimator of the underlying trend x
t
which is by denition an
2
See, in particular, the works of Henderson (1916), Whittaker (1923) and Macaulay (1931).
3
See for example Cleveland and Tiao (1976), Beveridge and Nelson (1981), Harvey (1991) or Hodrick and
Prescott (1997).
4
The paper of Savitzky and Golay (1964) is still considered by the Analytical Chemistry journal to be
one of its 10 seminal papers.
12
unobservable process. A ltering procedure consists of applying a lter / to the data y:
x = /(y)
with x = . . . , x
2
, x
1
, x
0
, x
1
, x
2
, . . .. When the lter is linear, we have x = /y with the
normalisation condition 1 = /1. If we assume that the signal y
t
is observed at regular
dates
5
, we obtain:
x
t
=
i=
/
t,ti
y
ti
(1)
We deduce that linear ltering may be viewed as a convolution. The previous lter may not
be of much use, however, because it uses future values of y
t
. As a result, we generally impose
some restriction on the coecients /
t,ti
in order to use only past and present values of the
signal. In this case, we say that the lter is causal. Moreover, if we restrict our study to
time invariant lters, the equation (1) becomes a simple convolution of the observed signal
y
t
with a window function /
i
:
x
t
=
n1
i=0
/
i
y
ti
(2)
With this notation, a linear lter is characterised by a window kernel /
i
and its support.
The kernel denes the type of ltering, whereas the support denes the range of the lter.
For instance, if we take a square window on a compact support [0, T] with T = n the
width of the averaging window, we obtain the well-known moving average lter:
/
i
=
1
n
1i < n
We nish this description by considering the lag representation:
x
t
=
n1
i=0
/
i
L
i
y
t
with the lag operator L satisfying Ly
t
= y
t1
.
2.2.2 Measuring the trend and its derivative
We discuss here how to use linear ltering to measure the trend of an asset price and its
derivative. Let S
t
be the asset price which follows the dynamics of the Black-Scholes model:
dS
t
S
t
=
t
dt +
t
dW
t
where
t
is the drift,
t
is the volatility and W
t
is a standard Brownian motion. The
asset price S
t
is observed in a series of discrete dates t
0
, . . . , t
n
. Within this model, the
appropriate signal to be ltered is the logarithm of the price y
t
= ln S
t
but not the price
itself. Let R
t
= ln S
t
ln S
t1
represent the realised return at time t over a unit period. If
t
and
t
are known, we have:
R
t
=
_
1
2
2
t
_
+
t
t
5
We have t
i+1
t
i
= .
13
where
t
is a standard Gaussian white noise. The ltered trend can be extracted using the
following equation:
x
t
=
n1
i=0
/
i
y
ti
and the estimator of
t
is
6
:

t

1
n1
i=0
/
i
R
ti
We can also obtain the same result by applying the lter directly to the signal and dening
the derivative of the window function as
i
=

/
i
:

t

1
i=0
i
y
ti
We obtain the following correspondence:
i
=
_
_
_
/
0
if i = 0
/
i
/
i1
if i = 1, . . . , n 1
/
n1
if i = n
(3)
Remark 1 In some senses,
t
and x
t
are related by the following expression:

t
=
d
dt
x
t
Econometric methods principally involve x
t
, whereas
t
is more important for trading strate-
gies.
Remark 2
t
is a biased estimator of
t
and the bias increases with the volatility of the
process
t
. The expression of the unbiased estimator is then:

t
=
1
2
2
t
+
1
n1
i=0
/
i
R
ti
Remark 3 In the previous analysis, x
t
and
t
are two estimators. We may also represent
them by their corresponding probability density functions. It is therefore easy to derive
estimates, but we should not forget that these estimators present some variance. In nance,
and in particular in trading strategies, the question of statistical inference is generally not
addressed. However, it is a crucial factor in designing a successful momentum strategy.
2.2.3 Moving average lters
Average return over a given period Here, we consider the simplest case corresponding
to the moving average lter where the form of the window is:
/
i
=
1
n
1i < n
In this case, the only calibration parameter is the window support, i.e. T = n. It char-
acterises the smoothness of the ltered signal. For the limit T 0, the window becomes
a Dirac distribution
t
and the ltered signal is exactly the same as the observed signal:
6
If we neglect the contribution from the term
2
t
. Moreover, we consider = 1 to simplify the calculation.
14
x
t
= y
t
. For T > 0, if we assume that the noise
t
is independent from x
t
and is a centered
process, the rst contribution of the ltered signal is the average trend:
x
t
=
1
n
n1
i=0
x
ti
If the trend is homogeneous, this average value is located at t (n 1) /2 by construction.
It means that the ltered signal lags the observed signal by a time period which is half the
window. To extract the derivative of the trend, we compute the derivative kernel
i
which
is given by the following formula:
i
=
1
n
(
i,0
i,n
)
where
i,j
is the Kronecker delta
7
. The main advantage of using a moving average lter is
the reduction of noise due to the central limit theorem. For the limit case n , the signal
is completely denoised but it corresponds to the average value of the trend. The estimator is
also biased. In trend ltering, we also face a trade-o between denoising maximisation and
bias minimisation. The problem is the calibration procedure for the lag window T. Another
way to determine the optimal parameter T
is to take into account the dynamics of the

trend.
The above moving average lter can be applied directly to the signal. However,
t
is
simply the cumulative return over the window period. It needs only the rst and last dates
of the period under consideration.
Moving average crossovers Many practitioners, and even individual investors, use the
moving average of the price itself as a trend indication, instead of the moving average of
returns. These moving averages are generally uniform moving averages of the price. Here
we will consider an average of the logarithm of the price, in order to be consistent with the
previous examples:
y
n
t
=
1
n
n1
i=0
y
ti
Of course, an average price does not estimate the trend
t
. This trend is estimated from
the dierence between two moving averages over two dierent time horizons n
1
and n
2
.
Supposing that n
1
> n
2
, the trend may be estimated from:

t

2
(n
1
n
2
)
( y
n
2
t
y
n
1
t
) (4)
In particular, the estimated trend is positive if the short-term moving average is higher
than the long-term moving average. Thus, the sign of the trend changes when the short-
term moving average crosses the long-term moving average. Of course, when the short-term
horizon n
1
is one, then the short-term moving average is just the current asset price. The
scaling term 2 (n
1
n
2
)
1
is explained below. It is derived from the interpretation of this
estimator as a weighted moving average of asset returns. Indeed, this estimator can be
interpreted in terms of asset returns by inverting the formula (3) with /
i
being interpreted
as the primitive of
i
:
/
i
=
_
_
_
0
if i = 0
i
+/
i1
if i = 1, . . . , n 1
n+1
if i = n
7
i,j
is equal to 1 if i = j and 0 otherwise.
15
The weighting of each return in the estimator (4) is represented in Figure 1. It forms a
triangle, and the biggest weighting is given at the horizon of the smallest moving average.
Therefore, depending on the horizon n
2
of the shortest moving average, the indicator can
be focused toward the current trend (if n
2
is small) or toward past trends (if n
2
is as large
as n
1
/2 for instance). From these weightings, in the case of a constant trend , we can
compute the expectation of the dierence between the two moving averages:
E[ y
n
2
t
y
n
1
t
] =
n
1
n
2
2
_

1
2
2
t
_
Therefore, the scaling factor in formula (4) appears naturally.

Figure 1: Window function /
i
of moving average crossovers (n
1
= 100)
Enhanced lters To improve the uniform moving average estimator, we may take the
following kernel function:
i
=
4
n
2
sgn
_
n
2
i
_
We notice that the estimator
t
now takes into account all the dates of the window period.
By taking the primitive of the function
i
, the trend lter is given as follows:
/
i
=
4
n
2
_
n
2

i
n
2
_
We now move to the second type of moving average lter which is characterised by an
asymmetric form of the convolution kernel. One possibility is to take an asymmetric window
function with a triangular form:
/
i
=
2
n
2
(n i) 1i < n
16
By computing the derivative of this window function, we obtain the following kernel:
i
=
2
n
(
i
1i < n)
The ltering equation of
t
then becomes:

t
=
2
n
_
x
t
1
n
n1
i=0
x
ti
_
Remark 4 Another way to dene
t
is to consider the Lanczos generalised derivative
(Groetsch, 1998). Let f (x) be a function. We dene the Lanczos derivative of f (x) in
terms of the following relationship:
d
L
dx
f (x) = lim
0
3
2
3
_

tf (x + t) dt
In the discrete case, we have:
d
L
dx
f (x) = lim
h0
n
k=n
kf (x + kh)
2
n
k=1
k
2
h
We rst notice that the Lanczos derivative is more general than the traditional derivative.
Although Lanczos formula is a more onerous method for nding the derivative, it oers
some advantages. This technique allows us to compute a pseudo-derivative at points where
the function is not dierentiable. For the observable signal y
t
, the traditional derivative does
not exist because of the noise
t
, but does in the case of the Lanczos derivative. Let us apply
the Lanczos formula to estimate the derivative of the trend at the point t T/2. We obtain:
d
L
dt
x
t
=
12
n
3
n
i=0
_
n
2
i
_
y
ti
We deduce that the kernel is:
i
=
12
n
3
_
n
2
i
_
10 i n
By computing an integration by parts, we obtain the trend lter:
/
i
=
6
n
3
i (n i) 10 i n
In Figure 2, we have represented the dierent functions /
i
given in this paragraph. We
may extend these lters by computing the convolution of two or more lters. For exemple,
the mixed lter in Figure 2 is the convolution of the asymmetric lter with the Lanczos
lter. Let us apply these lters to the S&P 500 index. The results are given in Figure 3
for two values of the window length (n = 65 days and n = 260 days). We notice that the
choice of n has a big impact on the ltered series. The choice of the window function seems
to be less important at rst sight. However, we should mention that traders are principally
interested in the derivative of the trend, and not the absolute value of the trend itself. In
this case, the window function may have a signicant impact. Figure 4 is the scatterplot of
the
t
statistic in the case of the S&P 500 index from January 2000 to July 2011 (we have
considered the uniform and Lanczos lters using n = 260). We may also show that this
impact increases when we reduce the length of the window as illustrated in Table 1.
17
i
of moving average lters (n = 100)
Figure 3: Trend estimate for the S&P 500 index
18
Table 1: Correlation between the uniform and Lanczos derivatives
n 5 10 22 65 130 260
Pearson 84.67 87.86 90.14 90.52 92.57 94.03
Kendall 65.69 68.92 70.94 71.63 73.63 76.17
Spearman 83.15 86.09 88.17 88.92 90.18 92.19
Figure 4: Comparison of the derivative of the trend
2.2.4 Least squares lters
L
2
ltering The previous Lanczos lter may be viewed as a local linear regression (Burch
et al., 2005). More generally, least squares methods are often used to dene trend estimators:
x
1
, . . . , x
n
= arg min
1
2
n
t=1
(y
t
x
t
)
2
However, this problem is not well-dened. We also need to impose some restrictions on the
underlying process y
t
or on the ltered trend x
t
to obtain a solution. For example, we may
consider a deterministic constant trend:
x
t
= x
t1
+
In this case, we have:
y
t
= t +
t
(5)
Estimating the ltered trend x
t
is also equivalent to estimating the coecient :
=
n
t=1
ty
t
n
t=1
t
2
19
If we consider a trend that is not constant, we may dene the following objective function:
1
2
n
t=1
(y
t
x
t
)
2
+
n1
t=2
( x
t1
2 x
t
+ x
t+1
)
2
In this function, is the regularisation parameter which controls the competition between
the smoothness
8
of x
t
and the noise y
t
x
t
. We may rewrite the objective function in the
vectorial form:
1
2
|y x|
2
2
+ |D x|
2
2
where y = (y
1
, . . . , y
n
), x = ( x
1
, . . . , x
n
) and the D operator is the (n 2) n matrix:
D =
_
_
1 2 1
1 2 1
.
.
.
1 2 1
1 2 1
_
_
The estimator is then given by the following solution:
x =
_
I + 2D
D
_
1
y
It is known as the Hodrick-Prescott lter (or L
2
lter). This lter plays an important role
in calibrating the business cycle.
Kalman ltering Another important trend estimation technique is the Kalman lter,
which is described in Appendix A.1. In this case, the trend
t
is a hidden process which
follows a given dynamic. For example, we may assume that the model is
9
:
_
R
t
=
t
+
t
=
t1
+
t
(6)
Here, the equation of R
t
is the measurement equation and R
t
is the observable signal of
realised returns. The hidden process
t
is supposed to follow a random walk. We dene

t|t1
= E
t1
[
t
] and P
t|t1
= E
t1
_
_

t|t1
t
_
2
_
. Using the results given in Appendix
A.1, we have:

t+1|t
= (1 K
t
)
t|t1
+ K
t
R
t
where K
t
= P
t|t1
/
_
P
t|t1
+
2
_
is the Kalman gain. The estimation error is determined
by Riccatis equation:
P
t+1|t
= P
t|t1
+
2
P
t|t1
K
t
Riccatis equation gives us the stationary solution:
P
2
_
+
_
+ 4
2
_
The lter equation becomes:

t+1|t
= (1 )
t|t1
+ R
t
8
We notice that the second term is the discrete derivative of the trend x
t
which characterises the smooth-
ness of the curve.
9
Equation (5) is a special case of this model if
= 0.
20
with:
=
2
+
_
+ 4
2
This Kalman lter can be considered as an exponential moving average lter with parame-
ter
10
= ln (1 ):

t
=
_
1 e
i=0
e
i
R
ti
with
11

t
= E
t
[
t
]. The lter of the trend x
t
is therefore determined by the following
equation:
x
t
=
_
1 e
i=0
e
i
y
ti
while the derivative of the trend may be directly related to the observed signal y
t
as follows:

t
=
_
1 e
_
y
t
_
1 e
_ _
e
1
_

i=1
e
i
y
ti
In Figure 5, we reported the window function of the Kalman lter for several values of .
We notice that the cumulative weightings increase strongly with . The half-life of this lter
is approximatively equal to ,
_
1
2
1
_
ln 2|. For example, the half-life for = 5% is 14
days.
i
of the Kalman lter
10
We have 0 < < 1 and lambda > 0.
11
We notice that
t+1|t
=
t
.
21
We may wonder what the link is between the regression model (5) and the Markov model
(6). Equation (5) is equivalent to the following state space model
12
:
_
y
t
= x
t
+
t
x
t
= x
t1
+
If we now consider that the trend is stochastic, the model becomes:
_
y
t
= x
t
+
t
x
t
= x
t1
+ +
t
This model is called the local level model. We may also assume that the slope of the trend
is stochastic, in which case we obtain the local linear trend model:
_
_
_
y
t
= x
t
+
t
x
t
= x
t1
+
t1
+
t
=
t1
+
t
These three models are special cases of structural models (Harvey, 1989) and may be easily
solved by Kalman ltering. We also deduce that the Markov model (6) is a special case of
the latter when
= 0.
Remark 5 We have shown that Kalman ltering may be viewed as an exponential moving
average lter when we consider the Markov model (6). Nevertheless, we cannot regard the
Kalman lter simply as a moving average lter. First, the Kalman lter is the optimal
lter in the case of the linear Gaussian model described in Appendix A.1. Second, it could
be regarded as an ecient computational solution of the least squares method (Sorensen,
1970). Third, we could use it to solve more sophisticated processes than the Markov model
(6). However, some nonlinear or non Gaussian models may be too complex for Kalman
ltering. These nonlinear models can be solved by particle lters or sequential Monte Carlo
methods (see Doucet et al., 1998).
Another important feature of the Kalman approach is the derivation of an optimal
smoother (see Appendix A.1). At time t, we are interested by the numerical value of x
t
, but
also by the past values of x
ti
because we would like to measure the slope of the trend. The
Kalman smoother improves the estimate of x
ti
by using all the information between t i
and t. Let us consider the previous example in relation to the S&P 500 index, using the local
level model. Figure 6 gives the ltered and smoothed components x
t
and
t
for two sets
of parameters
13
. We verify that the Kalman smoother reduces the noise by incorporating
more information. We also notice that the restriction
= 0 increases the variance of the

trend and slope estimators.
2.3 Nonlinear ltering
In this section, we review other ltering approaches. They are generally classed as nonlinear
lters, because it is not possible to express the trend as a linear convolution of the signal
and a window function.
12
In what follows, the noise processes are white noise:
t
N (0, 1),
t
N (0, 1) and
t
N (0, 1).
13
For the rst set of parameters, we assume that
= 100
and
=
1
/100
. For the second set of

parameters, we impose the restriction
= 0.
22
Figure 6: Kalman ltered and smoothed components
2.3.1 Nonparametric regression
In the regression model (5), we assume that x
t
= f (t) while f (t) = t. The model is said to
be parametric because the estimation of the trend consists of estimating the parameter .
We then have x
t
= t. With nonparametric regression, we directly estimate the function f,
obtaining x
t
=

f (t). Some examples of nonparametric regression are kernel regression, loess
regression and spline regression. A popular method for trend ltering is local polynomial
regression:
y
t
= f (t) +
t
=
0
() +
p
j=1
j
() ( t)
j
+
t
For a given value of , we estimate the parameters

j
() using weighted least squares with
the following weightings:
w
t
= /
_
t
h
_
where / is the kernel function with a bandwidth h. We deduce that:
x
t
= E[ y
t
[ = t] =

0
(t)
Cleveland (1979) proposed an improvement to the kernel regression through a two-stage
procedure (loess regression). First, we t a polynomial regression to estimate the residuals

t
. Then, we compute
t
=
_
1 u
2
t
_
1 [u
t
[ 1 with u
t
=
t
/ (6 median ([ [)) and run a
second kernel regression
14
with weightings
t
w
t
.
14
Cleveland (1979) suggests using the tricube kernel function to dene K.
23
A spline function is a C
2
function S () which corresponds to a cubic polynomial function
on each interval [t, t + 1[. Let oT be the set of spline functions. We then have to solve the
following optimisation programme:
min
SSP
(1 h)
n
t=0
w
t
(y
t
S (t))
2
+ h
_
T
0
w
()
2
d
where h is the smoothing parameter h = 0 corresponds to the interpolation case
15
and
h = 1 corresponds to the linear regression
16
.
Figure 7: Illustration of the kernel, loess and spline lters
We illustrate these three nonparametric methods in Figure 7. The calibration of these
lters is more complicated than for moving average lters, where the only parameter is the
length n of the window. With these methods, we have to decide the polynomial degree
17
p,
the kernel function
18
/ and the smoothing parameter
19
h.
2.3.2 L
1
ltering
The idea of the Hodrick-Prescott lter can be generalised to a larger class of lters by using
the L
p
penalty condition instead of the L
2
penalty. This generalisation was previously
15
We have x
t
= S (t) = y
t
.
16
We have x
t
= S (t) = c + t with ( c, ) the OLS estimate of y
t
on a constant and time t because the
optimum is reached for S
() = 0.
17
For the kernel regression, we use a Gaussian kernel with a bandwidth h = 0.10. We notice the impact
of the degree of polynomial. The higher the degree, the smoother the trend (and the slope of the trend).
18
For the loess regression, the degree of polynomial is set to 1 and the bandwidth h is 0.02. We show the
impact of the second step which modies the kernel function.
19
For the spline regression, we consider a uniform kernel function. We notice that the parameter h has an
impact on the smoothness of the trend.
24
discussed in the work of Daubechies et al. (2004) in relation to the linear inverse problem,
while Tibshirani (1996) considers the Lasso regression problem. If we consider an L
1
lter,
the objective function becomes:
1
2
n
t=1
(y
t
x
t
)
2
+
n1
t=2
[ x
t1
2 x
t
+ x
t+1
[
which is equivalent to the following vectorial form:
1
2
|y x|
2
2
+ |D x|
1
Kim et al. (2009) shows that the dual problem of this L
1
lter scheme is a quadratic
programme with some boundary constraints
20
. To nd x, we may also use the quadratic
programming algorithm, but Kim et al. (2009) suggest using the primal-dual interior point
method in order to optimise the numerical computation speed.
We have illustrated the L
1
lter in Figure 8. Contrary to all other previous methods, the
ltered signal comprises a set of straight trends and breaks
21
, because the L
1
norm imposes
the condition that the second derivative of the ltered signal must be zero. The competition
between the two terms in the objective function turns to the competition between the number
of straight trends (or the number of breaks) and the closeness to the data. Thus, the
smoothing parameter plays an important role for detecting the number of breaks. This
explains why L
1
ltering is radically dierent to L
2
(or Hodrick-Prescott) ltering. Moreover,
it is easy to compute the slope of the trend
t
for the L
1
lter. It is a step function, indicating
clearly if the trend is up or down, and when it changes (see Figure 8).
2.3.3 Wavelet ltering
Another way to estimate the trend x
t
is to denoise the signal y
t
by using spectral analy-
sis. The Fourier transform is an alternative representation of the original signal y
t
, which
becomes a frequency function:
y () =
n
t=1
y
t
e
it
We note y () = T (y). By construction, we have y = T
1
(y) with T
1
the inverse Fourier
transform. A simple idea for denoising in spectral analysis is to set some coecients y ()
to zero before reconstructing the signal. Figure 9 is an illustration of denoising using the
thresholding rule. Selected parts of the frequency spectrum can easily be manipulated by
ltering tools. For example, some can be attenuated, and others may be completely removed.
Applying the inverse Fourier transform to this ltered spectrum leads to a ltered time series.
Therefore, a smoothing signal can be easily performed by applying a low-pass lter, that is,
by removing the higher frequencies. For example, we have represented two denoised signals
of the S&P 500 index in Figure 9. For the rst one, we use a 95% thresholding procedure
whereas 99% of the Fourier coecients are set to zero in the second case. One diculty
with this approach is the bad time location for low frequency signals and the bad frequency
location for the high frequency signals. It is then dicult to localise when the trend (which
is located in low frequencies) reverses. But the main drawback of spectral analysis is that
it is not well suited to nonstationary processes (Martin and Flandrin, 1985, Fuentes, 2002,
Oppenheim and Schafer, 2009).
20
The detail of this derivation is shown in Appendix A.2.
21
A break is the position where the signal trend changes.
25
Figure 8: L
1
versus L
2
ltering
Figure 9: Spectral ltering
26
A solution consists of adopting a double dimension analysis, both in time and frequency.
This approach corresponds to the wavelet analysis. The method of denoising is the same as
described previously and the estimation of x
t
is done in three steps:
1. we compute the wavelet transform J of the original signal y
t
to obtain the wavelet
coecients = J(y);
2. we modify the wavelet coecients according to a denoising rule D:
= D()
3. We convert the modied wavelet coecients into a new signal using the inverse wavelet
transform J
1
:
x = J
1
(
)
There are two principal choices in this approach. First, we have to specify which mother
wavelet to use. Second, we have to dene the denoising rule. Let
and
+
be two scalars
with 0 <
<
+
. Donoho and Johnstone (1995) dene several shrinkage methods
22
:
Hard shrinkage
i
=
i
1
_
[
i
[ >
+
_
Soft shrinkage
i
= sgn (
i
)
_
[
i
[
+
_
+
Semi-soft shrinkage
i
=
_
_
_
0 si [
i
[
sgn (
i
) (
+
)
1
+
([
i
[
) si
< [
i
[
+
i
si [
i
[ >
+
Quantile shrinkage is a hard shrinkage method where w
+
is the q
th
quantile of the
coecients [
i
[.
Wavelet ltering is illustrated in Figure 10. We have computed the wavelet coecients
using the cascade algorithm of Mallat (1989) and the low-pass and high-pass lters of order
6 proposed by Daubechies (1992). The ltered trend is obtained using quantile shrinkage.
In the rst case, the noisy signal remains because we consider all the coecients (q = 0). In
the second and third cases, 95% and 99% of the wavelet coecients are set to zero
23
.
2.3.4 Other methods
Many other methods can be used to perform trend ltering. The most recent include, for
example, singular spectrum analysis
24
(Vautard et al., 1992), support vector machines
25
and empirical mode decomposition (Flandrin et al., 2004). Moreover, we notice that traders
sometimes use their own techniques (see, inter alia, Ehlers, 2001).
22
In practice, the coecients
i
are standardised before being computed.
23
It is interesting to note that the denoising procedure retains some wavelet coecients corresponding to
high and medium frequencies and located around the 2008 crisis.
24
See Appendix A.5 for an illustration.
25
A brief presentation is given in Appendix A.4.
27
Figure 10: Wavelet ltering
2.4 Multivariate ltering
Until now, we have assumed that the trend is specic to a nancial asset. However, we may
be interested in estimating the common trend of several nancial assets. For example, if we
wanted to estimate the trend of emerging markets equities, we could use a global index like
the MSCI EM or extract the trend by considering several indices, e.g. the Bovespa index
(Brazil), the RTS index (Russia), the Nifty index (India), the HSCEI index (China), etc. In
this case, the trend-cycle model becomes:
_
_
_
_
y
(1)
t
.
.
.
y
(m)
t
_
_
_
_
= x
t
+
_
_
_
_
(1)
t
.
.
.
(m)
t
_
_
_
_
where y
(j)
t
and
(j)
t
are respectively the signal and the noise of the nancial asset j and x
t
is the common trend. One idea for estimating the common trend is to obtain the mean of
the specic trends:
x
t
=
1
m
m
j=1
x
(j)
t
28
If we consider moving average ltering, it is equivalent to applying the lter to the average
lter
26
y
t
=
1
m
m
j=1
y
(j)
t
. This rule is also valid for some nonlinear lters such as L
1
ltering
(see Appendix A.2). In what follows, we consider the two main alternative approaches
developed in econometrics to estimate a (stochastic) common trend.
2.4.1 Error-correction model, common factors and the P-T decomposition
The econometrics of nonstationary time series may also help us to estimate a common trend.
y
(j)
t
is said to be integrated of order 1 if the change y
(j)
t
y
(j)
t1
is stationary. We will note
y
(j)
t
I (1) and (1 L) y
(j)
t
I (0). Let us now dene y
t
=
_
y
(1)
t
, . . . , y
(m)
t
_
. The vector y
t
is cointegrated of rank r if there exists a matrix of rank r such that z
t
=
y
t
I (0).
In this case, we show that y
t
may be specied by an error-correction model (Engle and
Granger, 1987):
y
t
= z
t1
+
i=1
i
y
ti
+
t
(7)
where
t
is a I (0) vector process. Stock and Watson (1988) propose another interesting
representation of cointegration systems. Let f
t
be a vector of r common factors which are
I (1). Therefore, we have:
y
t
= Af
t
+
t
(8)
where
t
is a I (0) vector process and f
t
is a I (1) vector process. One of the diculties with
this type of model is the identication step (Pea and Box, 1987). Gonzalo and Granger
(1995) suggest dening a permanent-transitory (P-T) decomposition:
y
t
= P
t
+ T
t
such that the permanent component P
t
is dierence stationary, the transitory component T
t
is covariance stationary and (P
t
, T
t
) satises a constrained autoregressive representation.
Using this framework and some other conditions, Gonzalo and Granger show that we may
obtain the representation (8) by estimating the relationship (7):
f
t
=
y
t
(9)
where
= 0. They then follow the works of Johansen (1988, 1991) to derive the maximum
likelihood estimator of . Once we have estimated the relationship (9), it is also easy to
identify the common trend
27
x
t
.
26
We have:
x
t
=
1
m
m
X
j=1
n1
X
i=0
L
i
y
(j)
ti
=
n1
X
i=0
L
i
0
@
1
m
m
X
j=1
y
(j)
ti
1
A
=
n1
X
i=0
L
i
y
ti
27
If a common trend exists, it is necessarily one of the common factors.
29
2.4.2 Common stochastic trend model
Another idea is to consider an extension of the local linear trend model:
_
_
_
y
t
= x
t
+
t
x
t
= x
t1
+
t1
+
t
=
t1
+
t
with y
t
=
_
y
(1)
t
, . . . , y
(m)
t
_
,
t
=
_
(1)
t
, . . . ,
(m)
t
_
A (0, ),
t
A (0, 1) and
t
A (0, 1).
Moreover, we assume that
t
,
t
and
t
are independent of each other. Given the parameters
(, ,
), we may run the Kalman lter to estimate the trend x

t
and the slope
t
whereas
the Kalman smoother allows us to estimate x
ti
and
ti
at time t.
Remark 6 The case
= 0 has been extensively studied by Chang et al. (2009). In

particular, they show that y
t
is cointegrated with =
1
and a m (m1) matrix
such that
1
= 0 and
1
= I
m1
. Using the P-T decomposition, they also found
that the common stochastic trend is given by
1
y
t
, implying that the above averaging
rule is not optimal.
We come back to the example given in Figure 6 page 22. Using the second set of
parameters, we now consider three stock indices: the S&P 500 index, the Stoxx 600 index
and the MSCI EM index. For each index, we estimate the ltered trend. Moreover, using the
previous common stochastic trend model
28
, we estimate the common trend for the bivariate
signal (S&P 500, Stoxx 600) and the trivariate signal (S&P 500, Stoxx 600, MSCI EM).
Figure 11: Multivariate Kalman ltering
28
We assume that
j
takes the value 1 for the three signals.
30
3 Trend ltering in practice
3.1 The calibration problem
For the practical use of the trend extraction techniques discussed above, the calibration of
ltering parameters is crucial. These calibrated parameters must incorporate our prediction
requirement or they can be mapped to a commonly-known benchmark estimator. These
constraints oer us some criteria for determining the optimal parameters for our expected
prediction horizon. Below, we consider two possible calibration schemes based on these
criteria.
3.1.1 Calibration based on prediction error
One idea for estimating the parameters of a model is to use statistical inference tools. Let
us consider the local linear trend model. We may estimate the set of parameters (
)
by maximising the log-likelihood function
29
:
=
1
2
n
t=1
ln 2 + ln F
t
+
v
2
t
F
t
where v
t
= y
t
E
t1
[y
t
] is the innovation process and F
t
= E
t1
_
v
2
t
is the variance of v
t
.
In Figure 12, we have reported the ltered and smoothed trend and slope estimated by the
maximum likelihood method. We notice that the estimated components are more noisy than
those obtained in Figure 6. We can explain this easily because maximum likelihood is based
on the one-day innovation process. If we want to look at a longer trend, we have to consider
the innovation process v
t
= y
t
E
th
[y
t
] where h is the horizon time. We have reported
the slope for h = 50 days in Figure 12. It is very dierent from the slope corresponding to
h = 1 day.
The problem is that the computation of the log-likelihood for the innovation process
v
t
= y
t
E
th
[y
t
] is trickier because there is generally no analytic expression. This is
why we do not recommend this technology for trend ltering problems, because the trends
estimated are generally very short-term. A better solution is to employ a cross-validation
procedure to calibrate the parameters of the lters discussed above. Let us consider the
calibration scheme presented in Figure 13. We divide our historical data into a training set
and a validation set, which are characterised by two time parameters T
1
and T
2
. The size
of training set T
1
controls the precision of our calibration, for a xed parameter . For this
training set, the value of the expectation of E
th
[y
t
] is computed. The second parameter
29
Another way of estimating the parameters is to consider the log-likelihood function in the frequency
domain analysis (Roncalli, 2010). In the case of the local linear trend model, the stationary form of y
t
is
S (y
t
) = (1 L)
2
y
t
. We deduce that the associated log-likelihood function is:
=
n
2
ln 2
1
2
n1
X
j=0
lnf (
j
)
1
2
n1
X
j=0
I (
j
)
f (
j
)
where I (
j
) is the periodogram of S (y
t
) and f () is the spectral density:
f () =
+ 2 (1 cos )
2
+ 4 (1 cos )
2
2
because we have:
S (y
t
) =
t1
+
(1 L)
t
+
(1 L)
2
t
31
Figure 12: Maximum likelihood of the trend and slope components
T
2
determines the size of the validation set, which is used to estimate the prediction error:
e (; h) =
nh
t=1
(y
t
E
th
[y
t
])
2
This quantity is directly related to the prediction horizon h = T
2
for a given investment
strategy. The minimisation of the prediction error leads to the optimal value
of the lter
parameters which will be used to predict the trend for the test set. For example, we apply
this calibration scheme for L
1
ltering for h equal to 50 days. Figure 14 illustrates the
calibration procedure for the S&P 500 index with T
1
= 400 and T
2
= 50. Minimising the
cumulative prediction error over the validation set gives the optimal value
= 7.03.
Figure 13: Cross-validation procedure for determining optimal parameters
[
-
|
[
-
T
1
Training set
[
-
T
2
Test set
[
-
T
2
Forecasting
3.1.2 Calibration based on benchmark estimator
The trend ltering algorithm can be calibrated with a benchmark estimator. In order to
illustrate this idea, we present in this discussion the calibration procedure for L
2
lters by
32
Figure 14: Calibration procedure with the S&P 500 index for the L
1
lter
using spectral analysis. Though the L
2
lter provides an explicit solution which is a great
advantage for numerical implementation, the calibration of the smoothing parameter is
not straightforward. We propose to calibrate the L
2
lter by comparing the spectral density
of this lter with that obtained using the uniform moving average lter with horizon n for
which the spectral density is:
f
MA
() =
1
n
2
n1
t=0
e
it
2
For the L
2
lter, the solution has the analytical form x =
_
1 + 2D
D
_
1
y. Therefore, the
spectral density can also be computed explicitly:
f
HP
() =
_
1
1 + 4(3 4 cos + cos 2)
_
2
This spectral density can then be approximated by 1/
_
1 + 2
4
_
2
. Hence, the spectral
width is (2)
1/4
for the L
2
lter whereas it is 2n
1
for the uniform moving average lter.
The calibration of the L
2
lter could be achieved by matching these two quantities. Finally,
we obtain the following relationship:

=
1
2
_
n
2
_
4
In Figure 15, we represent the spectral density of the uniform moving average lter for
dierent window sizes n. We also report the spectral density of the corresponding L
2
lters.
To obtain this, we calibrated the optimal parameter
by least square minimisation. In

33
Figure 16, we compare the optimal estimator
with that corresponding to 10.27
. We
notice that the approximation is very good
30
.
Figure 15: Spectral density of moving average and L
2
lters
3.2 What about the variance of the estimator?
Let
t
be the estimator of the slope of the trend. There may be a confusion between the
estimator of the slope and the estimated value of the slope (or the estimate). The estimator
is a random variable and is dened by a probability distribution function. Based on the
sample data, the estimator takes a value which is the estimate of the slope. Suppose that
we obtain an estimate of 10%. It means that 10% is the most likely value of the slope given
the data. But it does not mean that 10% is the true value of the slope.
3.2.1 Measuring the eciency of trend lters
Let
0
t
be the true value of the slope. In statistical inference, the quality of an estimator is
dened by the mean squared error (or MSE):
MSE(
t
) = E
_
_

t
0
t
_
2
_
It indicates how far the estimates are from the true value. We say that the estimator
(1)
t
is more ecient than the estimator
(2)
t
if its MSE is lower:

(1)
t
~
(2)
t
MSE
_

(1)
t
_
MSE
_

(2)
t
_
30
We estimated the gure 10.27 using least squares.
34
Figure 16: Relationship between the value of and the length of the moving average lter
We may decompose the MSE statistic into two components:
MSE(
t
) = E
_
(
t
E[
t
])
2
_
+E
__
E[
t
]
0
t
_
2
The rst component is the variance of the estimator var (
t
) whereas the second component
is the square of the bias B(
t
). Generally, we are interested by estimators that are unbiased
(B(
t
) = 0). If this is the case, comparing two estimators is equivalent to comparing their
variances.
Let us assume that the price process is a geometric Brownian motion:
dS
t
=
0
S
t
dt +
0
S
t
dW
t
In this case, the slope of the trend is constant and is equal to
0
. In Figure 17, we have
reported the probability density function of the estimator
t
when the true slope
0
is 10%.
We consider the estimator based on a uniform moving average lter of length n. First, we
notice that using lters is better than using the noisy signal. We also observe that the
variance of the estimators increases with the parameter
0
and decreases with the length n.
3.2.2 Trend detection versus trend ltering
In the previous paragraph, we saw that an estimate of the trend may not be signicant if
the variance of the estimator is too large. Before computing an estimate of the trend, we
then have to decide if there is a trend or not. This process is called trend detection. Mann
(1945) considers the following statistic:
S
(n)
t
=
n2
i=0
n1
j=i+1
sgn (y
ti
y
tj
)
35
Figure 17: Density of the estimator
t
Figure 18: Impact of
0
on the estimator
t
36
with sgn (y
ti
y
tj
) = 1 if y
ti
> y
tj
and sgn (y
ti
y
tj
) = 1 if y
ti
< y
tj
. We
have
31
:
var
_
S
(n)
t
_
=
n(n 1) (2n + 5)
18
We can show that:
n(n + 1)
2
S
(n)
t

n(n + 1)
2
The bounds are reached if y
t
< y
ti
(negative trend) or y
t
> y
ti
(positive trend) for i N
.
We can then normalise the score:
o
(n)
t
=
2S
(n)
t
n(n + 1)
o
(n)
t
takes the value +1 (or 1) if we have a perfect positive (or negative) trend. If there is
no trend, it is obvious that S
(n)
t
0. Under this null hypothesis, we have:
Z
(n)
t

n
A (0, 1)
with:
Z
(n)
t
=
S
(n)
t
_
var
_
S
(n)
t
_
In Figure 19, we reported the normalised score o
(n)
t
for the S&P 500 index and dierent
values of n. Statistics relating to the null hypothesis are given in Table 2 for the study
period. We notice that we generally reject the hypothesis that there is no trend when we
consider a period of one year. The number of cases when we observe a trend increases if we
consider a shorter period. For example, if n is equal to 10 days, we accept the hypothesis
that there is no trend in 42% of cases when the condence level is set to 90%.
Table 2: Frequencies of rejecting the null hypothesis with condence level
90% 95% 99%
n = 10 days 58.06% 49.47% 29.37%
n = 3 months 85.77% 82.87% 76.68%
n = 1 year 97.17% 96.78% 95.33%
Remark 7 We have reported the statistic o
(10)
t
against the trend estimate
32

t
for the S&P
500 index since January 2000. We notice that
t
may be positive whereas o
(10)
t
is negative.
This illustrates that a trend measurement is just an estimate. It does not mean that a trend
exists.
31
If there are some tied sequences (y
ti
= y
ti1
), the formula becomes:
var
S
(n)
t
=
1
18
n(n 1) (2n + 5)
g
X
k=1
n
k
(n
k
1) (2n
k
+ 5)
!
with g the number of tied sequences and n
k
the number of data points in the k
th
tied sequence.
32
It is computed with a uniform moving average of 10 days.
37
Figure 19: Trend detection for the S&P 500 index
Figure 20: Trend detection versus trend ltering
38
3.3 From trend ltering to trend forecasting
There are two possible applications for the trend following problem. First, trend ltering
can analyse the past. A noisy signal can be transformed into a smoother signal, which can be
interpreted more easily. An ex-post analysis of this kind can, for instance, clearly separate
increasing price periods from decreasing price periods. This analysis can be performed on
any time series, or even on a random walk. For example, we have reported four simulations
of a geometric Brownian motion without drift and annual volatility of 20% in Figure 21. In
this context, trend ltering could help us to estimate the dierent trends in the past.
Figure 21: Four simulations of a geometric Brownian motion without drift
On the other hand, trend analysis may be used as a predictive tool. Prediction is a
much more ambitious objective than analysing the past. It cannot be performed on any
time series. For instance, trend following predictions suppose that the last observed trend
inuences future returns. More precisely, these predictors suppose that positive (or negative)
trends are more likely to be followed by positive (or negative) returns. Such an assumption
has to be tested empirically. For example, it is obvious that the time series in Figure 21
exhibit certain trends, whereas we know that there is no trend in a geometric Brownian
motion without drift. Thus, we may still observe some trends in an ex-post analysis. It does
not mean, however, that trends will persist in the future.
The persistence of trends is tested here in a simple framework for major nancial in-
dices
33
. For each of these indices the average one-month returns are separated into two sets.
The rst set includes one-month returns that immediately follow a positive three-month
return (this is negative for the second set). The average one-month return is computed for
each of these two sets, and the results are given in Table 3. These results clearly show
33
The study period begins in January 1995 (January 1999 for the MSCI EM) and nish in October 2011.
39
Figure 22: Distribution of the conditional standardised monthly return
that, on average, higher returns can be expected after a positive three-month return than
after a negative three-month period. Therefore, observation of the current trend may have a
predictive value for the indices under consideration. Moreover, we consider the distribution
of the one-month returns, based on past three-month returns. Figure 22 illustrates the case
of the GSCI index. In the rst quadrant, the one-month returns are divided into two sets,
depending on whether the previous three-month return is positive or negative. The cumu-
lative distributions of these two sets are shown. In the second quadrant, we consider, on
the one hand, the distribution of one-month returns following a three-month return below
5% and, on the other hand, the distribution of returns following a three-month return
exceeding +5%. The same procedure is repeated in the other quadrants, for a 10% and a
15% threshold. This simple test illustrates the usefulness of trend following strategies. Here,
trends seem persistent enough to study such strategies. Of course, on other time scales or
for other assets, one may obtain opposite results that would support contrarian strategies.
Table 3: Average one-month conditional return based on past trends
Trend Positive Negative Dierence
Eurostoxx 50 1.1% 0.2% 0.9%
S&P 500 0.9% 0.5% 0.4%
MSCI WORLD 0.6% 0.3% 1.0%
MSCI EM 1.9% 0.3% 2.2%
TOPIX 0.4% 0.4% 0.9%
EUR/USD 0.2% 0.2% 0.4%
USD/JPY 0.2% 0.2% 0.4%
GSCI 1.3% 0.4% 1.6%
40
4 Conclusion
The ultimate goal of trend ltering in nance is to design portfolio strategies that may
benet from these trends. But the path between trend measurement and portfolio allocation
is not straightforward. It involves studies and explanations that would not t in this paper.
Nevertheless, let us point out some major issues. Of course, the rst problem is the selection
of the trend ltering method. This selection may lead to a single procedure or to a pool of
methods. The selection of several methods raises the question of an aggregation procedure.
This can be done through averaging or dynamic model selection, for instance. The resulting
trend indicator is meant to forecast future asset returns at a given horizon.
Intuitively, an investor should buy assets with positive return forecasts and sell assets
with negative forecasts. But the size of each long or short position is a quantitative problem
that requires a clear investment process. This process should take into account the risk
entailed by each position, compared with the expected return. Traditionally, individual
risks can be calculated in relation to asset volatility. A correlation matrix can aggregate
those individual risks into a global portfolio risk. But in the case of a multi-asset trend
following strategy, should we consider the correlation of assets or the correlation of each
individual strategy? These may be quite dierent, as the correlations between strategies
are usually smaller than the correlations between assets in absolute terms. Even when the
portfolio risks can be calculated, the distribution of those risks between assets or strategies
remains an open problem. Clearly, this distribution should take into account the individual
risks, their correlations and the expected return of each asset. But there are many competing
allocation procedures, such as Markowitz portfolio theory or risk budgeting methods.
In addition, the total amount of risk in the portfolio must be decided. The average target
volatility of the portfolio is closely related to the risk aversion of the nal investor. But this
total amount of risk may not be constant over time, as some periods could bring higher
expected returns than others. For example, some funds do not change the average size of
their positions during period of high market volatility. This increases their risks, but they
consider that their return opportunities, even when risk-adjusted, are greater during those
periods. On the contrary, some investors reduce their exposure to markets during volatility
peaks, in order to limit their potential drawdowns. Anyway, any consistent investment
process should measure and control the global risk of the portfolio.
These are just a few questions relating to trend following strategies. Many more arise in
practical cases, such as execution policies and transaction cost management. Each of these
issues must be studied in depth, and re-examined on a regular basis. This is the essence of
quantitative management processes.
41
A Statistical complements
A.1 State space model and Kalman ltering
A state space model is dened by a transition equation and a measurement equation. In
the measurement equation, we postulate the relationship between an observable vector and
a state vector, while the transition equation describes the generating process of the state
variables. The state vector
t
is generated by a rst-order Markov process of the form:
t
= T
t
t1
+ c
t
+ R
t
t
where
t
is the vector of the m state variables, T
t
is a m m matrix, c
t
is a m 1 vector
and R
t
is a mp matrix. The measurement equation of the state-space representation is:
y
t
= Z
t
t
+ d
t
+
t
where y
t
is a n-dimension time series, Z
t
is a n m matrix, d
t
is a n 1 vector.
t
and
t
are assumed to be white noise processes of dimensions p and n respectively. These two last
uncorrelated processes are Gaussian with zero mean and respective covariance matrices Q
t
and H
t
.
0
A (a
0
, P
0
) describes the initial position of the state vector. We dene a
t
and
a
t|t1
as the optimal estimators of
t
based on all the information available respectively at
time t and t 1. Let P
t
and P
t|t1
be the associated covariance matrices
34
. The Kalman
lter consists of the following set of recursive equations (Harvey, 1990):
_
_
a
t|t1
= T
t
a
t1
+ c
t
P
t|t1
= T
t
P
t1
T
t
+ R
t
Q
t
R
t
y
t|t1
= Z
t
a
t|t1
+ d
t
v
t
= y
t
y
t|t1
F
t
= Z
t
P
t|t1
Z
t
+ H
t
a
t
= a
t|t1
+ P
t|t1
Z
t
F
1
t
v
t
P
t
=
_
I
m
P
t|t1
Z
t
F
1
t
Z
t
_
P
t|t1
where v
t
is the innovation process with covariance matrix F
t
and y
t|t1
= E
t1
[y
t
]. Harvey
(1989) shows that we can obtain a
t+1|t
directly from a
t|t1
:
a
t+1|t
= (T
t+1
K
t
Z
t
) a
t|t1
+ K
t
y
t
+ (c
t+1
K
t
d
t
)
where K
t
= T
t+1
P
t|t1
Z
t
F
1
t
is the matrix of gain. We also have:
a
t+1|t
= T
t+1
a
t|t1
+ c
t+1
+ K
t
_
y
t
Z
t
a
t|t1
d
t
_
Finally, we obtain:
_
y
t
= Z
t
a
t|t1
+ d
t
+ v
t
a
t+1|t
= T
t+1
a
t|t1
+ c
t+1
+ K
t
v
t
This system is called the innovation representation.
Let t
be a xed given date. We dene a

t|t
= E
t
[
t
] and P
t|t
= E
t
_
_
a
t|t

t
_ _
a
t|t

t
_
_
with t t
. We have a
t
|t
= a
t
and P
t
|t
= P
t
. The Kalman smoother is then dened
by the following set of recursive equations:
P
t
= P
t
T
t+1
P
1
t+1|t
a
t|t
= a
t
+ P
t
_
a
t+1|t
a
t+1|t
_
P
t|t
= P
t
+ P
t
_
P
t+1|t
P
t+1|t
_
P
t
34
We have a
t
= E
t
[
t
], a
t|t1
= E
t1
[
t
], P
t
= E
t
h
(a
t
t
) (a
t
t
)
i
and P
t|t1
=
E
t1
h
`
a
t|t1
t
`
a
t|t1
i
where E
t
indicates the conditional expectation operator.
42
A.2 L
1
ltering
A.2.1 The dual problem
The L
1
ltering problem can be solved by considering the dual problem which is a QP
programme. We rst rewrite the primal problem with a new variable z = D x:
min
1
2
|y x|
2
2
+ |z|
1
u.c. z = D x
We now construct the Lagrangian function with the dual variable R
n2
:
/( x, z, v) =
1
2
|y x|
2
2
+ |z|
1
+
(D x z)
The dual objective function is obtained in the following way:
inf
x,z
/( x, z, ) =
1
2
DD
+ y
for 1 1. According to the Kuhn-Tucker theorem, the initial problem is equivalent

to the dual problem:
min
1
2
DD
u.c. 1 1
This QP programme can be solved by a traditional Newton algorithm or by interior-point
methods, and nally, the solution of the trend is:
x = y D
A.2.2 Solving using interior-point algorithms

We briey present the interior-point algorithm of Boyd and Vandenberghe (2009) in the case
of the following optimisation problem:
min f
0
()
u.c.
_
A = b
f
i
() < 0 for i = 1, . . . , m
where f
0
, . . . , f
m
: R
n
R are convex and twice continuously dierentiable and rank (A) =
p < n. The inequality constraints will become implicit if the problem is rewritten as:
min f
0
() +
m
i=1
J
(f
i
())
u.c. A = b
where J
(u) : R R is the non-positive indicator function

35
. This indicator function is
discontinuous, so the Newton method can not be applied. In order to overcome this prob-
lem, we approximate J
(u) using the logarithmic barrier function J
(u) =
1
ln (u)
35
We have:
I
(u) =
0 u 0
u > 0
43
with . Finally the Kuhn-Tucker condition for this approximation problem gives
r
t
(, , ) = 0 with:
r
(, , ) =
_
_
f
0
() +f ()
+ A
diag () f ()
1
1
A b
_
_
The solution of r
(, , ) = 0 can be obtained using Newtons iteration for the triple

= (, , ):
r
( + ) r
() +r
() = 0
This equation gives the Newton step = r
()
1
r
(), which denes the search

direction.
A.2.3 The multivariate case
In the multivariate case, the primal problem is:
min
1
2
m
j=1
_
_
_y
(j)
x
_
_
_
2
2
+ |z|
1
u.c. z = D x
The dual objective function becomes:
inf
x,z
/( x, z, ) =
1
2
DD
+ y
+
1
2
m
j=1
_
y
(j)
y
_
_
y
(j)
y
_
for 1 1. According to the Kuhn-Tucker theorem, the initial problem is equivalent
to the dual problem:
min
1
2
DD
u.c. 1 1
The solution is then x = y D
.
A.2.4 The scaling of the smoothing parameter
We can attempt to estimate the order of magnitude of the parameter
max
by considering
the continuous case. We assume that the signal is a process W
t
. The value of
max
in the
discrete case is dened by:
max
=
_
_
_
_
DD
_
1
Dy
_
_
_
can be considered as the rst primitive I

1
(T) =
_
T
0
W
t
dt of the process W
t
if D = D
1
(L
1
C ltering) or the second primitive I
2
(T) =
_
T
0
_
t
0
W
s
ds dt of W
t
if D = D
2
(L
1
T
ltering). We have:
I
1
(T) =
_
T
0
W
t
dt
= W
T
T
_
T
0
t dW
t
=
_
T
0
(T t) dW
t
44
The process I
1
(T) is a Wiener integral (or a Gaussian process) with variance:
E
_
I
2
1
(T)
=
_
T
0
(T t)
2
dt =
T
3
3
max
T
3/2
. The second order primitive can be calculated in
the following way:
I
2
(T) =
_
T
0
I
1
(t) dt
= I
1
(T) T
_
T
0
t dI
1
(T)
= I
1
(T) T
_
T
0
tW
t
dt
= I
1
(T) T
T
2
2
W
T
+
_
T
0
t
2
2
dW
t
=
T
2
2
W
T
+
_
T
0
_
T
2
Tt +
t
2
2
_
dW
t
=
1
2
_
T
0
(T t)
2
dW
T
This quantity is again a Gaussian process with variance:
E[I
2
2
(T)] =
1
4
_
T
0
(T t)
4
dt =
T
5
20
max
T
5/2
.
A.3 Wavelet analysis
The time analysis can detect anomalies in time series, such as a market crash on a specic
date. The frequency analysis detects repeated sequences in a signal. The double dimension
analysis makes it possible to coordinate time and frequency detection, as we use a larger
time window than a smaller frequency interval (see Figure 23). In this area, the uncertainty
of localisation is 1/dt, with dt the sampling step and f = 1/dt the sampling frequency. The
wavelet transform can be a solution to analysing time series in terms of the time-frequency
dimension.
The rst wavelet approach appeared in the early eighties in seismic data analysis. The
term wavelet was introduced in the scientic community by Grossmann and Morlet (1984).
Since 1986, a great deal of theoretical research, including wavelets, has been developed.
The wavelet transform uses a basic function, called the mother wavelet, then dilates and
translates it to capture features that are local in time and frequency. The distribution of the
time-frequency domain with respect to the wavelet transform is long in time when capturing
low frequency events and long in frequency when capturing high frequency events. As an
example, we represent some mother wavelets in Figure 24.
The aim of wavelet analysis is to separate signal trends and details. These dierent
components can be distinguished by dierent levels of resolution or dierent sizes/scales
of detail. In this sense, it generates a phase space decomposition which is dened by two
45
Figure 23: Time-frequency dimension
Figure 24: Some mother wavelets
46
parameters (scale and location) in opposition to a Fourier decomposition. A wavelet (t)
is a function of time t such that:
_
+
(t) dt = 0
_
+
[ (t)[
2
dt = 1
The continuous wavelet transform is a function of two variables W(u, s) and is given by
projecting the time series x(t) onto a particular wavelet by:
W (u, s) =
_
+
x(t)
u,s
(t) dt
with:
u,s
(t) =
1
_
t u
s
_
which corresponds to the mother wavelet translated by u (location parameter) and dilated
by s (scale parameter). If the wavelet satises the previous properties, the inverse operation
may be performed to produce the original signal from its wavelet coecients:
x(t) =
_
+
_
+
W (u, s) (u, s) duds

The continuous wavelet transform of a time series signal x(t) gives an innite number
of coecients W(u, s) where u R and s R
+
, but many coecients are close or equal to
zero. The discrete wavelet transform can be used to decompose a signal into a nite number
of coecients where we use s = 2
j
as the scale parameter and u = k2
j
as the location
parameter with j Z and k Z. Therefore
u,s
(t) becomes:
j,k
(t) = 2
j
2
_
2
j
t k
_
where j = 1, 2, ..., J in a J-level decomposition. The wavelet representation of a discrete
signal x(t) is given by:
x(t) = s
(0)
(t) +
J1
j=0
2
j1
k=0
d
(j),k
j,k
(t)
where (t) = 1 if t [0, 1] and J is the number of multi-resolution levels. Therefore,
computing the wavelet transform of the discrete signal is equivalent to compute the smooth
coecient s
(0)
and the detail coecients d
(j),k
.
Introduced by Mallat (1989), the multi-scale analysis corresponds to the following iter-
ative scheme:
x

s d

ss sd

sss ssd

ssss sssd
47
where the high-pass lter denes the details of the data and the low-pass lter denes the
smoothing signal. In this example, we obtain these wavelet coecients:
W =
_
_
ssss
sssd
ssd
sd
d
_
_
Applying this pyramidal algorithm to the time series signal up to the J resolution level gives
us the wavelet coecients:
W =
_
_
s
(0)
d
(0)
d
(1)
.
.
.
d
(J1)
_
_
A.4 Support vector machine
The support vector machine is an important part of statistical learning theory (Hastie et al.,
2009). It was rst introduced by Boser et al. (1992) and has been used in various domains
such as pattern recognition, biometrics, etc. This technique can be employed in dierent
contexts such as classication, regression or density estimation (see Vapnik, 1998). Recently,
applications in nance have been developed in two main directions. The rst employs the
SVM as a nonlinear estimator in order to forecast the trend or volatility of nancial assets.
In this context, the SVM is used as a regression technique with the possibility for extension
to nonlinear cases thank to the kernel approach. The second direction consists of using
the SVM as a classication technique which aims to dene the stock selection in trading
strategies.
A.4.1 SVM in a nutshell
We illustrate here the basic idea of the SVM as a classication method. Let us dene the
training data set consisting of n pairs of input/output points (x
i
, y
i
) where x
i
A and
y
i
1, 1. The idea of linear classication is to look for a possible hyperplane that
can separate x
i
A into two classes corresponding to the labels y
i
= 1. It consists of
constructing a linear discriminant function h(x) = w
x+b where w is the vector of weights

and b is called the bias. The hyperplane is then dened by the following equation:
H = x : h(x) = w
x + b = 0
The vector w is interpreted as the normal vector to the hyperplane. We denote its norm
|w| and its direction w = w/ |w|. In Figure 25, we give a geometric interpretation of the
margin in the linear case. Let x
+
and x
be the closest points to the hyperplane from the

positive side and negative side. These points determine the margin to the boundary from
which the two classes of points T are separated:
m
D
(h) =
1
2
w
(x
+
x
) =
1
|w|
48
Figure 25: Geometric interpretation of the margin in a linear SVM
The main idea of a maximum margin classier is to determine the hyperplane that maximises
the margin. For a separable dataset, the margin SVM is dened by the following optimisation
problem:
min
w,b
1
2
|w|
2
u.c. y
i
_
w
x
i
+ b
_
> 1 for i = 1, . . . , n
The historical approach to solving this quadratic problem with nonlinear constraints is to
map the primal problem to the dual problem:
max
i=1
1
2
n
i=1
n
j=1
j
y
i
y
j
x
i
x
j
u.c.
i
0 for i = 1, . . . , n
Because of the Kuhn-Tucker conditions, the optimised solution (w
, b
) of the primal problem

is given by w
n
i=1
i
y
i
x
i
where
= (
1
, . . . ,
n
) is the solution of the dual problem.
We notice that linear SVM depends on input data via the inner product. An intelligent
way to extend SVM formalism to the nonlinear case is then to replace the inner product
with a nonlinear kernel. Hence, the nonlinear SVM dual problem can be obtained by sys-
tematically replacing the inner product x
i
x
j
by a general kernel K (x
i
, x
j
). Some standard
kernels are widely used in pattern recognition, for example polynomial, radial basis or neural
49
network kernels
36
. Finally, the decision/prediction function is then given by:
f (x) = sgn h(x) = sgn
_
n
i=1
i
y
i
K (x, x
i
) + b
_
A.4.2 SVM regression
In the last discussion, we presented the basic idea of the SVM in the classication context.
We now show how the regression problem can be interpreted as a SVM problem. In the
general framework of statistical learning, the SVM problem consists of minimising the risk
function 1(f) depending on the form of the prediction function f (x). The risk function is
calculated via the loss function L(f (x) , y) which clearly denes our objective (classication
or regression):
1(f) =
_
L(f (x) , y) dP (x, y)
where the distribution P (x, y) can be computed by empirical distribution
37
or an approx-
imated distribution
38
. For the regression problem, the loss function is simply dened as
L(f (x) , y) = (f (x) y)
2
or L(f (x) , y) = [f (x) y[
p
in the case of L
p
norm.
We have seen that the linear SVM is a special case of nonlinear SVM within the kernel
approach. We therefore consider the nonlinear case directly where the approximate function
of the regression has the following form f (x) = w
(x) + b. In the VRM framework, we

assume that P (x, y) is a Gaussian noise with variance
2
:
1(f) =
1
n
n
i=1
[f (x
i
) y
i
[
p
+
2
|w|
2
We introduce the variable = (
1
, . . . ,
n
) which satises y
i
= f (x
i
) +
i
. The optimisa-
tion problem of the risk function can now be written as a QP programme with nonlinear
constraints:
min
w,b,
1
2
|w|
2
+
_
2n
2
_
1
n
i=1
[
i
[
p
u.c. y
i
= w
(x
i
) + b +
i
for i = 1, . . . , n
In the present form, the regression looks very similar to the SVM classication problem and
can be solved in the same way by mapping to the dual problem. We notice that the SVM
regression can be easily generalised in two possible ways:
1. by introducing a more general loss function such as the -SV regression proposed by
Vapnik (1998);
2. by using a weighting distribution for the empirical distribution:
dP (x, y) =
n
i=1
x
i
(x)
y
i
(y)
36
We have, respectively, K (x
i
, x
j
) =
`
x
i
x
j
+ 1
p
, K (x
i
, x
j
) = exp
x
i
x
j
2
/
`
2
2
or
K (x
i
, x
j
) = tanh
`
ax
i
x
j
b
.
37
This framework called ERM was rst introduced by Vapnik and Chervonenskis (1991).
38
This framework is called VRM (Chapelle, 2002).
50
As nancial series have short memory and depend more on the recent past, an asym-
metric weight distribution focusing on recent data would improve the prediction
39
.
The dual problem in the case p = 1 is given by:
max
y
1
2
K
u.c.
_

1 = 0
[[
_
2n
2
_
1
1
As previously, the optimal vector
is obtained by solving the QP programme. We then

deduce that w
n
i=1
i
(x
i
) and b
is computed using the Kuhn-Tucker condition:

w
(x
i
) + b y
i
= 0
for support vectors (x
i
, y
i
). In order to achieve a good level of accuracy for the estimation
of b, we average out the set of support vectors and obtain b
. The SVM regressor is then

given by the following formula:
f (x) =
n
i=1
i
K (x, x
i
) + b
with K (x, x
i
) = (x) (x
i
).
In Figure 26, we apply SVM regression with the Gaussian kernel to the S&P 500 index.
The kernel parameter characterises the estimation horizon which is equivalent to period
n in the moving average regression.
A.5 Singular spectrum analysis
In recent years the singular spectrum analysis (SSA) technique has been developed as a
time-frequency domain method
40
. It consists of decomposing a time series into a trend,
oscillatory components and a noise.
The method is based on the principal component analysis of the auto-covariance matrix
of the time series y = (y
1
, . . . , y
t
). Let n be the window length such that n = t m+1 with
m < t/2. We dene the n m Hankel matrix 1 as the matrix of the m concatenated lag
vector of y:
1 =
_
_
_
_
_
_
_
_
y
1
y
2
y
3
y
m
y
2
y
3
y
4
y
m+1
y
3
y
4
y
5

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
y
t1
y
n
y
n+1
y
n+2
y
t
_
_
_
_
_
_
_
_
We recover the time series y by diagonal averaging:
y
p
=
1
p
m
j=1
1
(i,j)
(10)
39
See Gestel et al. (2001) and Tay and Cao 2002.
40
Introduced by Broomhead and King (1986).
51
Figure 26: SVM ltering
where i = p j + 1, 0 < i < n + 1 and:
p
=
_
_
p if p < m
t p + 1 if p > t m+ 1
m otherwise
This relationship seems trivial because each 1
(i,j)
is equal to y
p
with respect to the condi-
tions for i and j. But this equality no longer holds if we apply factor analysis. Let ( = 1
1
be the covariance matrix of 1. By performing the eigenvalue decomposition ( = V V
, we
can deduce the corresponding principal components:
T
k
= 1V
k
where V
k
is the matrix of the rst k
th
eigenvectors of (.
Let us now dene the n m matrix

1 as follows:
1 = T
k
V
k
We have

1 = 1 if all the components are selected. If k < m, we have removed the noise and
the trend x is estimated by applying the diagonal averaging procedure (10) to the matrix
1.
We have applied the singular spectrum decomposition to the S&P 500 index with dierent
lags m. For each lag, we compute the Hankel matrix 1, then deduce the matrix

1 using
only the rst eigenvector (k = 1) and estimate the corresponding trend. Results are given
in Figure 27. As for other methods, such as nonlinear lters, the calibration depends on the
parameter m, which controls the window length.
52
Figure 27: SSA ltering
53
References
[1] Alexandrov T., Bianconcini S., Dagum E.B., Maass P. and McElroy T. (2008),
A Review of Some Modern Approaches to the Problem of Trend Extraction , US Census
Bureau, RRS #2008/03.
[2] Antoniadis A., Gregoire G. and McKeague I.W. (1994), Wavelet Methods for
Curve Estimation, Journal of the American Statistical Association, 89(428), pp. 1340-
1353.
[3] Barberis N. and Thaler T. (2002), A Survey of Behavioral Finance, NBER Working
Paper, 9222.
[4] Beveridge S. and Nelson C.R. (1981), A New Approach to the Decomposition of
Economic Time Series into Permanent and Transitory Components with Particular
Attention to Measurement of the Business Cycle, Journal of Monetary Economics,
7(2), pp. 151-174.
[5] Boser B.E., Guyon I.M. and Vapnik V. (1992), A Training Algorithm for Optimal
Margin Classier, Proceedings of the Fifth Annual Workshop on Computational Learn-
ing Theory, pp. 114-152.
[6] Boyd S. and Vandenberghe L. (2009), Convex Optimization, Cambridge University
Press.
[7] Brockwell P.J. and Davis R.A. (2003), Introduction to Time Series and Forecasting,
Springer.
[8] Broomhead D.S. and King G.P. (1986), On the Qualitative Analysis of Experimental
Dynamical Systems, in Sarkar S. (ed.), Nonlinear Phenomena and Chaos, Adam Hilger,
pp. 113-144.
[9] Brown S.J., Goetzmann W.N. and Kumar A. (1998), The Dow Theory: William
Peter Hamiltons Track Record Reconsidered, Journal of Finance, 53(4), pp. 1311-1333.
[10] Burch N., Fishback P.E. and Gordon R. (2005), The Least-Squares Property of the
Lanczos Derivative, Mathematics Magazine, 78(5), pp. 368-378.
[11] Carhart M.M. (1997), On Persistence in Mutual Fund Performance, Journal of Fi-
nance, 52(1), pp. 57-82.
[12] Chan L.K.C., Jegadeesh N. and Lakonishok J. (1996), Momentum Strategies, Jour-
nal of Finance, 51(5), pp. 1681-1713.
[13] Chang Y., Miller J.I. and Park J.Y. (2009), Extracting a Common Stochastic Trend:
Theory with Some Applications, Journal of Econometrics, 150(2), pp. 231-247.
[14] Chapelle O. (2002), Support Vector Machine: Induction Principles, Adaptive Tuning
and Prior Knowledge, PhD thesis, University of Paris 6.
[15] Cleveland W.P. and Tiao G.C. (1976), Decomposition of Seasonal Time Series: A
Model for the Census X-11 Program, Journal of the American Statistical Association,
71(355), pp. 581-587.
[16] Cleveland W.S. (1979), Robust Locally Regression and Smoothing Scatterplots, Jour-
nal of the American Statistical Association, 74(368), pp. 829-836.
54
[17] Cleveland W.S. and Devlin S.J. (1988), Locally Weighted Regression: An Approach
to Regression Analysis by Local Fitting, Journal of the American Statistical Associa-
tion, 83(403), pp. 596-610.
[18] Cochrane J. (2001), Asset Pricing, Princeton University Press.
[19] Cortes C. and Vapnik V. (1995), Support-Vector Networks, Machine Learning, 20(3),
pp. 273-297.
[20] DAspremont A. (2011), Identifying Small Mean Reverting Portfolios, Quantitative
Finance, 11(3), pp. 351-364.
[21] Daubechies I. (1992), Ten Lectures on Wavelets, SIAM.
[22] Daubechies I., Defrise M. and De Mol C. (2004), An Iterative Thresholding Al-
gorithm for Linear Inverse Problems with a Sparsity Constraint, Communications on
Pure and Applied Mathematics, 57(11), pp. 1413-1457.
[23] Donoho D.L. (1995), De-Noising by Soft-Thresholding, IEEE Transactions on Infor-
mation Theory, 41(3), pp. 613-627.
[24] Donoho D.L. and Johnstone I.M. (1994), Ideal Spatial Adaptation via Wavelet
Shrinkage, Biometrika, 81(3), pp. 425-455.
[25] Donoho D.L. and Johnstone I.M. (1995), Adapting to Unknown Smoothness via
Wavelet Shrinkage, Journal of the American Statistical Association, 90(432), pp. 1200-
1224.
[26] Doucet A., De Freitas N. and Gordon N. (2001), Sequential Monte Carlo in Prac-
tice, Springer.
[27] Ehlers J.F. (2001), Rocket Science for Traders: Digital Signal Processing Applications,
John Wiley & Sons.
[28] Elton E.J. and Gruber M.J. (1972), Earnings Estimates and the Accuracy of Expec-
tational Data, Management Science, 18(8), pp. 409-424.
[29] Engle R.F. and Granger C.W.J. (1987), Co-Integration and Error Correction: Rep-
resentation, Estimation, and Testing, Econometrica, 55(2), pp. 251-276.
[30] Fama E. (1970), Ecient Capital Markets: A Review of Theory and Empirical Work,
Journal of Finance, 25(2), pp. 383-417.
[31] Flandrin P., Rilling G. and Goncalves P. (2004), Empirical Mode Decomposition
as a Filter Bank, Signal Processing Letters, 11(2), pp. 112-114.
[32] Fliess M. and Join C. (2009), A Mathematical Proof of the Existence of Trends in
Financial Time Series, in El Jai A., A L. and Zerrik E. (eds), Systems Theory:
Modeling, Analysis and Control, Presses Universitaires de Perpignan, pp. 43-62.
[33] Fuentes M. (2002), Spectral Methods for Nonstationary Spatial Processes, Biometrika,
89(1), pp. 197-210.
[34] Genay R., Seluk F. and Whitcher B. (2002), An Introduction to Wavelets and
Other Filtering Methods in Finance and Economics, Academic Press.
55
[35] Gestel T.V., Suykens J.A.K., Baestaens D., Lambrechts A., Lanckriet G.,
Vandaele B., De Moor B. and Vandewalle J. (2001), Financial Time Series Pre-
diction Using Least Squares Support Vector Machines Within the Evidence Framework,
IEEE Transactions on Neural Networks, 12(4), pp. 809-821.
[36] Golyandina N., Nekrutkin V.V. and Zhigljavsky A.A. (2001), Analysis of Time
Series Structure: SSA and Related Techniques, Chapman & Hall, CRC.
[37] Gonzalo J. and Granger C.W.J. (1995), Estimation of Common Long-Memory Com-
ponents in Cointegrated Systems, Journal of Business & Economic Statistics, 13(1), pp.
27-35.
[38] Grinblatt M., Titman S. and Wermers R. (1995), Momentum Investment Strate-
gies, Portfolio Performance, and Herding: A Study of Mutual Fund Behavior, American
Economic Review, 85(5), pp. 1088-1105.
[39] Groetsch C.W. (1998), Lanczos Generalized Derivative, American Mathematical
Monthly, 105(4), pp. 320-326.
[40] Grossmann A. and Morlet J. (1984), Decomposition of Hardy Functions into Square
Integrable Wavelets of Constant Shape, SIAM Journal of Mathematical Analysis, 15,
pp. 723-736.
[41] Hrdle W. (1992), Applied Nonparametric Regression, Cambridge University Press.
[42] Harvey A.C. (1989), Forecasting, Structural Time Series Models and the Kalman Fil-
ter, Cambridge University Press.
[43] Harvey A.C. and Trimbur T.M. (2003), General Model-Based Filters for Extracting
Cycles and Trends in Economic Time Series, Review of Economics and Statistics, 85(2),
pp. 244-255.
[44] Hastie T., Tibshirani R. and Friedman R. (2009), The Elements of Statistical Learn-
ing, second edition, Springer.
[45] Henderson R. (1916), Note on Graduation by Adjusted Average, Transactions of the
Actuarial Society of America, 17, pp. 43-48.
[46] Hodrick R.J. and Prescott E.C. (1997), Postwar U.S. Business Cycles: An Empirical
Investigation, Journal of Money, Credit and Banking, 29(1), pp. 1-16.
[47] Holt C.C. (1959), Forecasting Seasonals and Trends by Exponentially Weighted Mov-
ing Averages, ONR Research Memorandum, 52, reprinted in International Journal of
Forecasting, 2004, 20(1), pp. 5-10.
[48] Hong H. and Stein J.C. (1977), A Unied Theory of Underreaction, Momentum Trad-
ing and Overreaction in Asset Markets, NBER Working Paper, 6324.
[49] Johansen S. (1988), Statistical Analysis of Cointegration Vectors, Journal of Economic
Dynamics and Control, 12(2-3), pp. 231-254.
[50] Johansen S. (1991), Estimation and Hypothesis Testing of Cointegration Vectors in
Gaussian Vector Autoregressive Models, Econometrica, 52(6), pp. 1551-1580.
[51] Kalaba R. and Tesfatsion L. (1989), Time-varying Linear Regression via Flexible
Least Squares, Computers & Mathematics with Applications, 17, pp. 1215-1245.
56
[52] Kalman R.E. (1960), A New Approach to Linear Filtering and Prediction Problems,
Transactions of the ASME Journal of Basic Engineering, 82(D), pp. 35-45.
[53] Kendall M.G. (1973), Time Series, Charles Grin.
[54] Kim S-J., Koh K., Boyd S. and Gorinevsky D. (2009),
1
Trend Filtering, SIAM
Review, 51(2), pp. 339-360.
[55] Kolmogorov A.N. (1941), Interpolation and Extrapolation of Random Sequences,
Izvestiya Akademii Nauk SSSR, Seriya Matematicheskaya, 5(1), pp. 3-14.
[56] Macaulay F. (1931), The Smoothing of Time Series, National Bureau of Economic
Research.
[57] Mallat S.G. (1989), A Theory for Multiresolution Signal Decomposition: The Wavelet
Representation, IEEE Transactions on Pattern Analysis and Machine Intelligence,
11(7), pp. 674-693.
[58] Mann H.B. (1945), Nonparametric Tests against Trend, Econometrica, 13(3), pp. 245-
259.
[59] Martin W. and Flandrin P. (1985), Wigner-Ville Spectral Analysis of Nonstationary
Processes, IEEE Transactions on Acoustics, Speech and Signal Processing, 33(6), pp.
1461-1470.
[60] Muth J.F. (1960), Optimal Properties of Exponentially Weighted Forecasts, Journal
of the American Statistical Association, 55(290), pp. 299-306.
[61] Oppenheim A.V. and Schafer R.W. (2009), Discrete-Time Signal Processing, third
edition, Prentice-Hall.
[62] Pea D. and Box, G.E.P. (1987), Identifying a Simplifying Structure in Time Series,
Journal of the American Statistical Association, 82(399), pp. 836-843.
[63] Pollock, D.S.G. (2006), Wiener-Kolmogorov Filtering Frequency-Selective Filtering
and Polynomial Regression, Econometric Theory, 23, pp. 71-83.
[64] Pollock D.S.G. (2009), Statistical Signal Extraction: A Partial Survey, in Kon-
toghiorges E. and Belsley D.E. (eds.), Handbook of Empirical Econometrics, John Wiley
and Sons.
[65] Rao S.T. and Zurbenko I.G. (1994), Detecting and Tracking Changes in Ozone air
Quality, Journal of Air and Waste Management Association, 44(9), pp. 1089-1092.
[66] Roncalli T. (2010), La Gestion dActifs Quantitative, Economica.
[67] Savitzky A. and Golay M.J.E. (1964), Smoothing and Dierentiation of Data by
Simplied Least Squares Procedures, Analytical Chemistry, 36(8), pp. 1627-1639.
[68] Silverman B.W. (1985), Some Aspects of the Spline Smoothing Approach to Non-
Parametric Regression Curve Fitting, Journal of the Royal Statistical Society, B47(1),
pp. 1-52.
[69] Sorenson H.W. (1970), Least-Squares Estimation: From Gauss to Kalman, IEEE
Spectrum, 7, pp. 63-68.
57
[70] Stock J.H. and Watson M.W. (1988), Variable Trends in Economic Time Series,
Journal of Economic Perspectives, 2(3), pp. 147-174.
[71] Tay F.E.H. and Cao L.J. (2002), Modied Support Vector Machines in Financial Times
Series Forecasting, Neurocomputing, 48(1-4), pp. 847-861.
[72] Tibshirani R. (1996), Regression Shrinkage and Selection via the Lasso, Journal of
the Royal Statistical Society, B58(1), pp. 267-288.
[73] Vapnik V. (1998), Statistical Learning Theory, John Wiley and Sons, New York.
[74] Vapnik V. and Chervonenskis A. (1991), On the Uniform Convergence of Relative
Frequency of Events to their Probabilities, Theory of Probability and its Applications,
16(2), pp. 264-280.
[75] Vautard R., Yiou P., and Ghil M. (1992), Singular Spectrum Analysis: A Toolkit
for Short, Noisy Chaotic Signals, Physica D, 58(1-4), pp. 95-126.
[76] Wahba G. (1990), Spline Models for Observational Data, CBMS-NSF Regional Con-
ference Series in Applied Mathematics, 59, SIAM.
[77] Wang Y. (1998), Change Curve Estimation via Wavelets, Journal of the American
Statistical Association, 93(441), pp. 163-172.
[78] Wiener N. (1949), Extrapolation, Interpolation and Smoothing of Stationary Time
Series with Engineering Applications, MIT Technology Press and John Wiley & Sons
(originally published in 1941 as a Report on the Services Research Project, DIC-6037).
[79] Whittaker E.T. (1923), On a New Method of Graduation, Proceedings of the Edin-
burgh Mathematical Society, 41, pp. 63-75.
[80] Winters P.R. (1960), Forecasting Sales by Exponentially Weighted Moving Averages,
Management Science, 6(3), 324-342.
[81] Yue S. and Pilon P. (2004), A Comparison of the Power of the t-test, Mann-Kendall
and Bootstrap Tests for Trend Detection, Hydrological Sciences Journal, 49(1), 21-37.
[82] Zurbenko I., Porter P.S., Rao S.T., Ku J.K., Gui R. and Eskridge R.E. (1996),
Detecting Discontinuities in Time Series of Upper-Air Data: Demonstration of an Adap-
tive Filter Technique, Journal of Climate, 9(12), pp. 3548-3560.
58
59
Lyxor White Paper Series
List of Issues
Issue #1 Risk-Based Indexation.
Paul Demey, Sbastien Maillard and Thierry Roncalli, March 2010.
Issue #2 Beyond Liability-Driven Investment: New Perspectives on
Dened Benet Pension Fund Management.
Benjamin Bruder, Guillaume Jamet and Guillaume Lasserre, March 2010.
Issue #3 Mutual Fund Ratings and Performance Persistence.
Pierre Hereil, Philippe Mitaine, Nicolas Moussavi and Thierry Roncalli, June 2010.
Issue #4 Time Varying Risk Premiums & Business Cycles: A Survey.
Serge Darolles, Karl Eychenne and Stphane Martinetti, September 2010.
Issue #5 Portfolio Allocation of Hedge Funds.
Benjamin Bruder, Serge Darolles, Abdul Koudiraty and Thierry Roncalli, January
2011.
Issue #6 Strategic Asset Allocation.
Karl Eychenne, Stphane Martinetti and Thierry Roncalli, March 2011.
Issue #7 Risk-Return Analysis of Dynamic Investment Strategies.
Benjamin Bruder and Nicolas Gaussel, June 2011.
60
61
Disclaimer
Each of this material and its content is condential and may not be reproduced or provided
to others without the express written permission of Lyxor Asset Management (Lyxor AM).
This material has been prepared solely for informational purposes only and it is not intended
to be and should not be considered as an oer, or a solicitation of an oer, or an invitation
or a personal recommendation to buy or sell participating shares in any Lyxor Fund, or
any security or nancial instrument, or to participate in any investment strategy, directly
or indirectly.
It is intended for use only by those recipients to whom it is made directly available by Lyxor
AM. Lyxor AM will not treat recipients of this material as its clients by virtue of their
receiving this material.
This material reects the views and opinions of the individual authors at this date and in
no way the ocial position or advices of any kind of these authors or of Lyxor AM and thus
does not engage the responsibility of Lyxor AM nor of any of its ocers or employees. All
performance information set forth herein is based on historical data and, in some cases, hy-
pothetical data, and may reect certain assumptions with respect to fees, expenses, taxes,
capital charges, allocations and other factors that aect the computation of the returns.
Past performance is not necessarily a guide to future performance. While the information
(including any historical or hypothetical returns) in this material has been obtained from
external sources deemed reliable, neither Socit Gnrale (SG), Lyxor AM, nor their af-
liates, ocers employees guarantee its accuracy, timeliness or completeness. Any opinions
expressed herein are statements of our judgment on this date and are subject to change with-
out notice. SG, Lyxor AM and their aliates assume no duciary responsibility or liability
for any consequences, nancial or otherwise, arising from, an investment in any security or
nancial instrument described herein or in any other security, or from the implementation
of any investment strategy.
Lyxor AM and its aliates may from time to time deal in, prot from the trading of, hold,
have positions in, or act as market makers, advisers, brokers or otherwise in relation to the
securities and nancial instruments described herein.
Service marks appearing herein are the exclusive property of SG and its aliates, as the
case may be.
This material is communicated by Lyxor Asset Management, which is authorized and reg-
ulated in France by the Autorit des Marchs Financiers (French Financial Markets Au-
thority).
c _2011 LYXOR ASSET MANAGEMENT ALL RIGHTS RESERVED
Lyxor Asset Management
Tour Socit Gnrale 17 cours Valmy
92987 Paris La Dfense Cedex France
research@lyxor.com www.lyxor.com
The Lyxor White Paper Series is a quarterly publication providing our
clients access to intellectual capital, risk analytics and quantitative
research developed within Lyxor Asset Management. The Series
covers in depth studies of investment strategies, asset allocation
methodologies and risk management techniques. Wehope you will
nd the Lyxor White Paper Series stimulating and interesting.
PUBLISHING DIRECTORS
Alain Dubois, Chairman of the Board
Laurent Seyer, Chief Executive Ofcer
EDITORIAL BOARD
Nicolas Gaussel, PhD, Managing Editor.
Thierry Roncalli, PhD, Associate Editor
Benjamin Bruder, PhD, Associate Editor
R
f
.

7
1
2
1
0
0

S
t
u
d
io

S
o
c
i
t
r
a
le

+
3
3

(
0
)
1

4
2

1
4

2
7

0
5

1
2
/
2
0
1
1

Momentum Strategies

Загружено:

Сведения о документе

Исходное описание:

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Momentum Strategies

Загружено:

Авторское право:

Доступные форматы

Electronic copy available at: http://ssrn.

allows a very simple numerical implementation with

Compute the boundaries

n. It means that the estimation error is small if

and b = (1 + 3/4) /12.

= 15%. Figure 2.10

f = 0.015 and for BBVA SQ Equity

is the expected volatility of the strategy and

= 15% with historical data since 01/01/2001 to 31/12/2011. We present

6T/ with the optimal error

of the IGARCH estimator then use

which will be used to evaluate the volatility for

, we perform an optimization scheme on the logarithm like-

of each estimator corresponds to the maximum of the logarithm like-

, we use the formula (2.7) to

The rst term var(X) is scaled as t (estimation horizon) and E

Then a subgrid 1 is dened as:

Let us now consider a sequence of subgrids 1

and the subgrid is chosen as following:

where the indice k = 1, . . . , K and n

is given by Zhang et al.:

are equidistant from the decision

has the following form:

by solving the associated QP program described above, we can

, we evaluate this value as the average

), we can construct the decision

as the last case which satisfy the

by the KKT condition:

3.3.2 Primal approach

for 1 1. According to the Kuhn-Tucker theorem, the initial problem is

A.1.2 The interior-point algorithm

(u) : R R is the non-positive indicator function

(u) by the logarithmic barrier function

(x, , ) = 0 can be obtained by Newtons iteration for the triple

(y) which denes the

can be considered as the rst primitive I

by least square minimization.

with the one corresponding to

. We notice that the approximation is very good.

). The Lagrangian of this

0. Minimizing the Lagrangian with respect to (w, b, )

. In order to compute b, we use the KKT condition:

is to take into account the dynamics of the

Therefore, the scaling factor in formula (4) appears naturally.

= 0 increases the variance of the

. For the second set of

), we may run the Kalman lter to estimate the trend x

= 0 has been extensively studied by Chang et al. (2009). In

by least square minimisation. In

with that corresponding to 10.27

be a xed given date. We dene a

for 1 1. According to the Kuhn-Tucker theorem, the initial problem is equivalent

A.2.2 Solving using interior-point algorithms

(u) : R R is the non-positive indicator function

(u) using the logarithmic barrier function J

(, , ) = 0 can be obtained using Newtons iteration for the triple

(), which denes the search

can be considered as the rst primitive I

W (u, s) (u, s) duds

x+b where w is the vector of weights

be the closest points to the hyperplane from the

) of the primal problem