Академический Документы
Профессиональный Документы
Культура Документы
[Extended Abstract]
Martin Gavrilov Dragomir Anguelov Piotr Indyk Rajeev Motwani
Department of Computer Science
Stanford University
Stanford, CA 94306-9010
{martinga, drago, indyk, rajeev}@cs.stanford.edu
ABSTRACT most of the work has focused on the design of efficient algo-
In recent years, there has been a lot of interest in the database rithms for various mining problems and, most notably, the
community in mining time series data. Surprisingly, little search of similar (sub)sequences with respect to a variety of
work has been done on verifying which measures are most measures [1, 2, 3, 4, 5, 7, 8, 10].
suitable for mining of a given class of data sets. Such work is
of crucial importance, since it enables us to identify similar- Surprisingly, little work has been done on verifying which
ity measures which are useful in a given context and there- measures are most suitable for mining of a given class of data
fore for which efficient algorithms should be further investi- sets. Such work is of crucial importance, since it enables us
gated. Moreover, an accurate evaluation of the performance to identify similarity measures which are useful in a given
of even existing algorithms is not possible without a good context and, therefore, for which efficient algorithms should
understanding of the data sets occurring in practice. be further investigated. Moreover, an accurate evaluation
of the performance of the existing algorithms is not possible
In this work we attempt to fill this gap by studying similarity without good understanding of the data sets occurring in
measures for clustering of similar stocks (which, of course, practice.
is an interesting problem on its own). Our approach is to
cluster the stocks according to various measures (including In this work we attempt to fill this gap by studying measures
several novel ones) and compare the results to the ”ground- for clustering of similar stocks (see [11] for more information
truth” clustering based on the Standard and Poor 500 Index. about mining financial data). We obtained the stock-market
Our experiments reveal several interesting facts about the data for 500 stocks from the Standard & Poor (S & P) in-
similarity measures used for stock-market data. dex for the year 1998. Each stock is a series of 252 numbers,
representing the price of the stock at the beginning of an
Categories and Subject Descriptors operational day. Every time series is assigned to one out of
I.5.3 [Clustering]: Similarity measures 102 clusters (e.g. “Computers (Hardware)”, “Oil and Gas”,
etc). Assuming this classification as a “ground-truth”, we
try to re-create it by running a clustering algorithm on the
Keywords stock data using a variety of similarity measures. Then the
time series, stock, clustering, similarity measures, data min- respective measures are evaluated by comparing the result-
ing ing clustering to the original S & P classification (we use
other evaluation methods as well).
1. INTRODUCTION
In recent years, there has been a lot of interest in the database Our experiments exhibited several very interesting proper-
community in mining time series data. Such data naturally ties of stock market data. First of all, the best clustering
arise in business as well as scientific decision-support appli- results were obtained for a novel measure proposed in this
cations; examples include stock market data (probably the paper which uses piecewise normalization. The main idea
most studied time series examples in history), production behind this approach is to split the sequences into blocks
capacities, population statistics, and sales amounts. Since and perform separate normalization within each block. Our
the data sets occurring in practice tend to be very large, results indicate that this approach yields quite powerful re-
sults for stock-market data. Another interesting observation
is that comparing the normalized derivatives of the price se-
quences resulted in better clusterings than comparing the
actual sequences (this phenomenon is widely known in the fi-
nancial community). In particular, the combination of both
of the above ideas results in the highest quality clustering
obtained in this paper, depicted in Table 5. One can ob-
serve that the majority of our clusters have a one-to-one
correspondence with one of the original S & P clusters, in
the sense that for each of these clusters (say C) the S & P is equal to the difference between the (i + 1)-th and i-
cluster closest to C also chooses C as its closest cluster. All th value of the sequence. Both mappings are natural
of the remaining clusters are not nearest neighbors of any S in the context of time-series data.
& P cluster.
2. Normalization: in this step we decide if and how we
The high quality of the clustering obtained using derivatives should normalize the vectors. The standard normal-
has very interesting implications, since the performance anal- ization is done by computing the mean of the vector co-
ysis for most of the times series data structures assumes that ordinates and subtracting it from all coordinates (note
the sequences are smooth1 , which is clearly not the case for that in this way the mean becomes equal to 0) and then
the derivatives. Therefore, our results suggest that new al- dividing the vector by its L2 norm. This step allows
gorithmic techniques should be developed, to capture the us to bring together stocks which follow similar trends
scenarios in which non-smooth time series data are present. but are valued differently, e.g., due to stock splits (note
that our time series are not adjusted for splits). We
also introduce a novel normalization method which we
2. SETUP DESCRIPTION call piecewise normalization. The idea here is to split
The Data. We have used the Standard and Poor 500 index the sequence into windows, and perform normalization
(S&P) historical stock data published at (as described above) separately within each window.
http://kumo.swcp.com/stocks/ . There are approximately In this way we take into account local similarities, as
500 stocks which daily price fluctuations are recorded over opposed to the global similarity captured by the nor-
the course of one year. malization of the whole vector.
Each stock is a sequence of some length d, where d ≤ 252 3. Dimensionality reduction: in this step we aim to re-
(the latter number is the number of days in 1998 when the duce the dimensionality of the vector space while pre-
stock market was operational, but d can be smaller if the serving (or perhaps even improving) the quality of
company is removed from the Index). We used only the the representation. Our first dimensionality reduction
day’s opening price; the data also contains the closing price, technique is based on the Principal Component Anal-
and the low and high stock valuation for the day. ysis (PCA). PCA maps vectors xn in a d-dimensional
space (x1 , ..., xd ) onto vectors zn in an M -dimensional
The data also contained the official S&P clustering informa- space, where M < d. PCA finds d orthonormal ba-
tion which groups the different stocks into industry groups sis vectors ui , called also principal components, and
based on their primary business focus. This information was retains only a subset M < d of these principal com-
also used in our experiments, with the assumption that it ponents to represent the projections of vectors xn into
provides us with a basis for a “ground-truth” with which we the lower-dimensional space. PCA exploits the tech-
can compare and rate the results of our unsupervised cluster- nique of Singular Value Decomposition(SVD), which
ing algorithm. We abstracted the 102 members of this S&P finds the eigenvalues and eigenvectors of the covari-
clustering into 62 “superclusters” by combining closely re- ance matrix
lated ones together, e.g., “Automobiles” and “Auto (Parts)” X n
or “Computers (Software)” with “Computers (Hardware)”. Σ= (x − x̄)(xn − x̄)T
n
Feature Selection. Our feature selection approach con- where x̄ is the mean of all vectors xn . The principal
sists of three main steps, depicted on the following picture: components are shown to be the eigenvectors corre-
sponding to the M largest eigenvalues of Σ and the
Representation Normalization Dim reduction input vectors are projected onto the eigenvectors to
- PCA
- raw data - global
- Aggregation give the components of the transformed vectors zn in
- first derivative - piecewise
- Fourier Transform the M -dimensional space.
- none
- none
Our second technique, aggregation, is based on the
assumption that local fluctuation of the stock (say,
Figure 1: Feature extraction process within the period of 10 days) is not as important as its
global behavior, and therefore that we can replace a 10
day period by the average stock price during that time.
In particular, we split the time domain into windows
1. Representation choice: in this step we map the original of length B (for B = 5, 10, 20 etc) and replace each
time series into a point in d-dimensional space, where window by its average value. Clearly, this decreases
d is close to the length of the sequence. We use two the dimensionality by a factor of B.
types of mapping: identity and first derivative (or FD Our third technique is based on the Fourier Transform
for short). In the first case, the whole sequence is (e.g., see [1] for the description). Basically, we used
considered to be one 252-dimensional point. In the truncated spectral representations, i.e., we represented
second case, the i-th coordinate of the derivative vector a time-series by only a few of its lowest frequencies.
1
E.g., [1, 7] approximate a sequence by removing all but
few elements in the Fourier representation of a sequence; the
quality of approximation in this case relies on the fact that Until now we described how we compare sequences of iden-
high frequency component of a signal have low amplitude, tical length. In order to compare a pair of sequences of
which is clearly not the case for the derivative sequence. different lengths, we take only the relevant portion of the
longer time series, and perform the aforementioned process- 3. RESULTS
ing only on that part. However, in order to perform PCA, We started from finding the right parameters for the eigen-
we need to assume that all points have the same dimension, value decomposition for each particular feature-type.
so instead, we pad all shorter sequences with zeros.
After applying SVD on the raw data, it turned out that
Similarity measure. We use the Euclidean distance be- 97.62% of the eigenvalue weight is in the first 5 values,
tween the feature vectors. and 98.88% is in the first 10 values. This suggests that
a projection of this 252-dimensional data in 10- or even 5-
The clustering method. We use Hierarchical Agglomer- dimensional space will result in a negligible loss of informa-
ative Clustering (HAC), which involves building a hierarchi- tion (see Figure 2 and 3).
cal classification of objects by a series of binary mergers (ag-
glomerations), initially of individual objects, later of clusters 0.8
formula:
0.7
|Ci ∩ Cj′ |
Sim(Ci , Cj′ ) = 2
|Ci | + |Cj′ | 0.6
and
! 0.5
′
X
Sim(C, C ) = max Sim(Ci , Cj′ ) /k 0.4
j
i
0.3
not symmetric.
0
1 2 3 4 5 6 7 8 9 10 11
0.2
0.045
0.15 0.04
0.1 0.035
0.03
0.05
0.025
0
0 50 100 150 200 250 300
0.02
0.015
0
0 50 100 150 200 250 300
0.4
0.25
0
0 5 10 15 20 25 Therefore, for the further dimensionality reduction experi-
ments, we choose the dimensionality for the above scenarios
to be equal to respectively 5, 10, 50 and 100.
Figure 5: First 25 eigenvalues for raw data after
global normalization Clusters. The results of the clusterings (together with their
parameters) are depicted in Tables 1, 2, 3 and 4. Table 1
0.25 was obtained by clustering the stocks according using 8 vari-
ants of the similarity metrics, from the space
0.2
{raw data, first derivative} x {global normalization, no
normalization}
0.15 x {all dimensions, reduced dimension}
0.1
The number of clusters was set to 62 (i.e., equal to the num-
ber of S & P “superclusters”).
0.05
The second table shows the results for dimensionality reduc-
tion via aggregation. In this case, we vary the size of the
aggregation window, the data representation (raw data or
0
FD) and normalization (global or none).
0 50 100 150 200 250 300
0.3 0.3
0.25 0.25
0.2 0.2
0.15 0.15
0.1 0.1
0.05 0.05
0 0
0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100
Figure 8: Fourier Transform vs. globally normalized Figure 9: Globally normalized raw data vs. globally
derivatives normalized derivatives
applying the same normalization to the whole sequence is and mining of time series. This work has been started
not the best solution, and that one can obtain better results by Agrawal et al [1], who showed that indexing time se-
by using piecewise normalization, for a carefully chosen win- quences can be done very efficiently using Fourier Trans-
dow size. The fact that piecewise normalization behaves so form. This work has been extended in [7] to finding simi-
good can be explained by observing that it greatly reduces lar subsequences. Another important contribution has been
the effect of “local anomalies” on the distance between two done by Agrawal et al in [2], who also addressed the issue
stock indices. This is due to the fact that such “anomalous” of sequence transformations (like scaling and translations)
events affect only one or two windows, while others can still as well as noise resilience. Further work in this area has
adjust to each other if their behavior is similar, even if the been done by Rafiei-Mendelzon [10], Bollobas et al [3], Das
actual valuations are very different. et al [4] and Huang-Yu [8].
0.3
0.15
INTC: Electronics (Semiconductors)
CPQ: Computers (Hardware)
0.1
IBM: Computers (Hardware)
0.05
KLAC: Equipment (Semiconductor)
AMAT: Equipment (Semiconductor)
0
0 10 20 30 40 50 60 70 80 90 100 TLAB: Communications Equipment
ADBE: Computers (Software & Services)
Figure 10: Piecewise normalization vs. global nor- COMS: Computers (Networking)
Table 6: Cluster # 1
WB: Banks (Major Regional)
STI: Banks (Major Regional)
PNC: Banks (Major Regional)
NOB: Banks (Major Regional)
BKB: Banks (Major Regional)
NCC: Banks (Major Regional)
BBK: Banks (Major Regional)
WFC: Banks (Major Regional)
BR: Oil & Gas (Exploration & Production)
MEL: Banks (Major Regional)
APA: Oil & Gas (Exploration & Production)
STT: Banks (Major Regional)
APC: Oil & Gas (Exploration & Production)
FITB: Banks (Major Regional)
MRO: Oil (Domestic Integrated)
BAC: Banks (Money Center)
ORX: Oil & Gas (Exploration & Production)
SUB: Banks (Major Regional)
PZL: Oil (Domestic Integrated)
ONE: Banks (Major Regional)
KMG: Oil & Gas (Exploration & Production)
KEY: Banks (Major Regional)
AHC: Oil (Domestic Integrated)
FTU: Banks (Money Center)
UCL: Oil (Domestic Integrated)
USB: Banks (Major Regional)
UPR: Oil & Gas (Exploration & Production)
RNB: Banks (Major Regional)
SNT: Natural Gas
CMA: Banks (Major Regional)
OXY: Oil (Domestic Integrated)
MWD: Financial (Diversified)
XON: Oil (International Integrated)
MER: Investment Banking/Brokerage
MOB: Oil (International Integrated)
LEH: Investment Banking/Brokerage
CHV: Oil (International Integrated)
SCH: Investment Banking/Brokerage
ARC: Oil (Domestic Integrated)
AIG: Insurance (Multi-Line)
P: Oil (Domestic Integrated)
SAI: Financial (Diversified)
RD: Oil (International Integrated)
NTRS: Banks (Major Regional)
AN: Oil (International Integrated)
BT: Banks (Money Center)
MDR: Engineering & Construction
JPM: Banks (Money Center)
RDC: Oil & Gas (Drilling & Equipment)
CMB: Banks (Money Center)
SLB: Oil & Gas (Drilling & Equipment)
CCI: Financial (Diversified)
HAL: Oil & Gas (Drilling & Equipment)
AXP: Financial (Diversified)
BHI: Oil & Gas (Drilling & Equipment)
BK: Banks (Major Regional)
HP: Oil & Gas (Drilling & Equipment)
GDT: Health Care (Medical Products & Supplies)
KRB: Consumer Finance
Table 8: Cluster # 8
NSC: Railroads
CSX: Railroads
DRI: Restaurants
TAN: Retail (Computers & Electronics)
MTG: Financial (Diversified)
MAR: Lodging-Hotels
HLT: Lodging-Hotels
MMM: Manufacturing (Diversified)
PCAR: Trucks & Parts
Table 7: Cluster # 3
CPL: Electric Companies
FPL: Electric Companies
DUK: Electric Companies
AEE: Electric Companies
F: Automobiles
PEG: Electric Companies
CAT: Machinery (Diversified)
HOU: Electric Companies
GM: Automobiles
UCM: Electric Companies
DAL: Airlines
EIX: Electric Companies
AMR: Airlines
PCG: Electric Companies
U: Airlines
CIN: Electric Companies
LUV: Airlines
BGE: Electric Companies
CYM: Metals Mining
NSP: Electric Companies
PD: Metals Mining
AEP: Electric Companies
AR: Metals Mining
ED: Electric Companies
RLM: Aluminum
SO: Electric Companies
AA: Aluminum
DTE: Electric Companies
AL: Aluminum
FE: Electric Companies
N: Metals Mining
D: Electric Companies
CS: Computers (Networking)
CSR: Electric Companies
ALT: Iron & Steel
GPU: Electric Companies
SIAL: Chemicals (Specialty)
TXU: Electric Companies
PPL: Electric Companies
Table 9: Cluster # 22
ETR: Electric Companies
PE: Electric Companies