Академический Документы
Профессиональный Документы
Культура Документы
3D Subspace
Clustering for
Value Investing
Kelvin Sim, Institute for Infocomm Research
V
alue investing is an investment strategy where the investor believes that a
A 3D-subspace-
stock’s fundamentals determine future stock prices.1 A value investor an-
clustering method
alyzes stock fundamentals and buys stocks that are undervalued with the belief
generates rules
that the prices of the stocks will rise in the future.2 The success of value investing
to pick potential
is evident in the stock market, with many which financial ratios and what values of
undervalued famous value investors’ portfolios, such as these ratios are related to undervalued stocks.
Warren Buffett’s, outperforming the market For example, Benjamin Graham, the founder
stocks; 3D subspace indices. There are also many successful mu- of value investment, prefers stocks with a
tual funds that follow the philosophy of value price–earnings ratio of no more than 7.2
clustering is effective investing, such as Third Avenue Value Fund, Using Graham’s rules on picking stocks has
which managed 5.04 billion dollars of assets in been proven to generate profits for value in-
in handling high- 2011. Academic research has shown that stock vestors. An experiment conducted over an
fundamentals are related to stock prices.3 eight year period from 1973 to 1980 showed
dimensional Stock fundamentals can be measured by this strategy to be profitable.1 We propose us-
stock financial ratios. For example, the re- ing 3D subspace clustering to generate rules
financial data and turn-on-equity ratio measures a stock’s ef- to pick potential undervalued stocks. The
ficiency in using its assets to generate profit, 3D subspace-clustering method is effective in
is adaptive to new the debt–equity ratio measures the amount of handling high-dimensional financial data and
the stock’s assets that are debts, and the price– is adaptive to new data. In addition, its re-
data. earnings ratio measures the ratio of the stock’s sults aren’t influenced by human biases and
current price to its current earnings. emotions, and are easily interpretable. We
Therefore, scrutinizing financial ratios is conducted extensive experimentation in the
important in finding undervalued stocks.2 stock market over a period of 28 years (from
However, there’s no perfect rule that shows 1980 to 2007), and we found that using rules
S2
knowledge. These rules are generally S3
used to select a comfortable number of S4
stocks for the investor to conduct fur- S5
ther analysis. Hence, these rules pro-
vide some general decision support.
For an inexperienced investor, manu- 1 2 3 4 5 6 7 8 9 10
ally setting rules on the financial ratios Year (Time)
can be difficult, and even for the ex-
(b)
perienced investor, he or she might be
prone to set irrational and biased rules.
The investor can stick to Graham’s Figure 1. (a) Example of a 3D financial dataset defined by stocks, financial ratios,
and years. The highlighted region is an actionable 3D subspace cluster of stocks s2,
rules, but the relevance of these rules s3, s 4 that have similar financial fundamentals reflected in financial ratios r2, r3, r4 .
at present time remains to be seen. (b) The price returns of the stocks. Stocks s2, s3, s 4 have high price returns.
Hence, the following problem needs to
be addressed: How do we find rules on
financial ratios that are related to highThe 3D subspace-clustering approach r3[…], r4[0, 2] in year 10, where r[j, k]
stock price returns? We should note groups stocks that have similar funda- denotes that the stock’s value on finan-
here that we define the price return mentals (financial ratios) and high price cial ratio r should fall between values j
of a stock as (sold price – purchased returns across years. The highlighted and k. If there’s a stock whose set of fu-
price)/purchased price. region in Figure 1a is a 3D subspace ture years contain this rule, this stock is
There are financial studies that in- cluster containing stocks s2, s3, s4 that a potentially undervalued stock.
vestigate the impact of single finan- have similar fundamentals reflected in 3D subspace clustering is suitable
cial ratios on stock prices.3 However, financial ratios r2, r3, r4 for years 1–3, for this value investing problem, due
different financial ratios quantify dif- 5–6, and 8–10. From Figure 1b, we can to the following reasons:
ferent aspects of a stock, so to get the see that stocks s2, s3, s4 have high price
complete picture, it will be useful to returns. • Effective in handling financial ratio
study the collective influence of finan- This cluster’s subspace can be used data. Financial ratio data is high di-
cial ratios on the stock prices, and thisas a rule that’s related to high price re- mensional, as the number of financial
is a nontrivial problem. turns. For future years, if there’s a stock ratios and timestamps can be large.
whose financial ratio values fall in this Techniques such as traditional clus-
Proposed Solution subspace, we can consider this stock as tering (for example, k-means clus-
We propose using 3D subspace- a potential undervalued stock. Using the tering) suffer from the curse of
clustering algorithms to mine rules that example in Figure 1a, the rule is r2[2, 3], dimensionality in this type of data;
are related to high stock price returns. r3[10, 11], r4[5, 6] in year 1, …, r2[…], the stocks are equidistant from each
other in the full space of the data, which is also known as the i-period use them as training data for the 3D-
hence it’s difficult to cluster them.4 simple net return.5 subspace-clustering algorithms to pick
We developed 3D subspace clustering For the sake of brevity, we also stocks. The partitioned datasets are
to overcome this curse of dimension- denote ret(o) as the price return of denoted as Dt , t ∈ {1980, …, 1999},
ality. We achieve this by clustering stock o, if the year it’s bought and the with each Dt containing data of the
stocks based on similar subsets of fi- year it’s sold aren’t required to be ex- set of years T = {t, …, t + 9}. For ex-
nancial ratios (data subspace). The fi- plicitly stated. ample, D1980 contains data from
nancial ratio data is continuous and A 3D subspace cluster is a subcuboid 1980 to 1989.
3D subspace clustering is generally C = O × A × T, with its axes defined We also processed these 10-year da-
used on this type of data. by a subset of stocks O ⊆ O, a subset tasets Dt to contain only stocks that
• Adaptive to new data. The finan- of financial ratios A ⊆ A, and a sub- have high price returns. These data-
cial ratio data is constantly chang- set of years T ⊆ T. We denote {C1, …, sets are required as training data for
ing and 3D subspace clustering can Cm} as the set of 3D subspace clusters certain 3D-subspace-clustering algo-
t
be easily reapplied on the new data mined from the dataset D. rithms. More specifically, Dmin ret
is a
to get the updated results. processed dataset that contains stocks
• Easy interpretation of results. The Research Design o, whose CAGR(o, t, t + 9)³ minret,
investor can easily analyze 3D sub- We present the research design of our given that minret is a threshold. The
space clusters because the clusters experiments. The research design con- compound annual growth rate is
are explicitly created. sists of three main phases: data prepa- 1
p(o, t + 9 ) 9
ration, stock picking, and data analysis. CAGR(o, t , t + 9) = − 1.
p(o, t)
We aren’t trying to solve the prob-
lem of how to invest, for example, Data Preparation We use compound annual growth
or what stocks to buy at a particu- In the data preparation phase, we ob- rate instead of average return to mini-
lar time. Instead, we’re trying to de- tained raw financial figures of US mize the effect of volatility of periodic
termine if 3D subspace clustering can stocks from Compustat (see www. returns.
help the value investor in his or her compustat.com) and converted this in- We vary minret from 0.1 to 0.5, as
stock selection process by decreasing formation into a 3D dataset of financial there are no valid stocks in some 10-year
t
the pool of stocks to select. ratios. We removed microcap stocks datasets Dmin ret
for minret > 0.5.
To evaluate the effectiveness of using (whose prices are less than $5) from the
3D subspace clustering for value invest- data, as these stocks have a high risk of Stock Picking
ing, we compare its profits and risks to being manipulated and their financial Graham’s rule-based strategy consists
those of Graham’s rule-based strategy. figures are less transparent. of a buy phase and a sell phase.1 In
We converted the raw financial fig- the buy phase, a stock is bought if it
Preliminaries ures into 30 financial ratios, based satisfies at least one reward criterion
Let the 3D financial dataset be a on the ratios’ formula from Investo- and one risk criterion. The criteria are
cuboid D with its axes defined by ob- pedia (see www.investopedia.com/ shown in Table 1. If a stock satisfies
jects (stocks) O, attributes (financial university/ratios). at least one reward criterion and one
ratios) A, and time stamps (years) T, We prepared a financial dataset risk criterion on year t, we will pur-
for example D = O × A × T. D containing 30 financial ratios chase the stock on the last day of its
Let the value of financial ratio a of and spanning 28 years (from 1980 fiscal year t.
stock o, in year t, be denoted as voat. to 2007). The number of stocks in- In the sell phase, the stock will be sold
Let p(o, t) be the closing price of stock creased from 3,335 to 5,049, due to either on the last day of its fiscal year
o at the end of the fiscal year t, which the stock market’s expansion. Some t + 2, or on the day when its price appre-
we use as the buying and selling prices (14.7 percent) of the dataset contain ciates by 50 percent, whichever comes
in our experiments. The price return missing values. first. We slightly tweak the sell phase, as
of stock o, bought at year t and sold Graham’s rule-based strategy uses we are only able to obtain the price of
at year t + i, is calculated as 10 years of financial ratio data to pick the last day of each fiscal year of a stock.
stocks and to have a fair compari- A stock will be sold on the last day of
p(o, t + i) − p(o, t) son; we also partition the financial its fiscal year t + 1 if its price appreci-
ret(o, t , t + 1) = ,
p(o, t) ratio data into 10-year datasets, and ates to more than 50 percent, or it will
}} if, ∀a ∈ A : v
the same sell phase described for
{
D
| Ostrat |
∈T = t1 ,..., t T o ′at1′ ∈ Graham’s rule-based strategy in the
previous “Stock Picking Phase” section The strategy’s risk on training data
( ) ( )
boundary Vat1 ,..., vo ′atn′ ∈ boundary Vatn for the 3D subspace-clustering strategy. set D is its standard deviation of the
average return. A high standard de- TRICLUSTER correlated when the values in the clus-
viation implies that the strategy is TRICLUSTER is the pioneer algorithm ter have high co-occurrences and these
risky and volatile. Let d ret denote the for mining 3D subspace clusters, which co-occurrences aren’t by chance.
risk-free return that the investor is are denoted as triclusters.7 A tricluster
assumed to have. In calculating the can be transformed into a wide varia- CATSeeker
standard deviation, we shouldn’t in- tion of 3D subspace clusters, depend- The price return of the stocks can be
corporate returns that have at least ing on the setting of the TRICLUSTER crucial information in clustering, but
d ret. Thus, we calculate the strategy’s algorithm’s parameters.7 In a tricluster the previous three algorithms don’t
risk on training dataset D using the C = O × A × T, the stocks O have ho- incorporate this information. The
downside standard deviation, which mogeneous values in the set of finan- CATSeeker algorithm incorporates this
is Definition 4 (the risk of strategy): cial ratios A in each year t ∈ T, and information, and its clusters are de-
the homogeneity and size criterion are noted as CATSs.10 A CATS satisfies the
riskD
strat satisfied subject to the setting of pa- following criterion: the 3D subspace
∑ o ∈O D
strat | ret(o) <δ ret
(ret(o) − retstrat )2
.
rameters d, minO, minA, minT. We also
set its parameters d y = dz = ∞, as they
cluster C = O × A × T is actionable
when ∀t ∈ T, that is, the stocks in O
= D
| {o | o ∈ Ostrat ∧ ret(o) < δ ret } | −1 aren’t applicable in mining our desired are similar on the set of financial ratios
clusters. The clusters are sensitive to A; and the stocks in O have high and
A strategy is thus desirable if it the parameters, and careful setting of correlated price returns in years T.
gives high average return and low the parameters is required. Given a set of centroids, the optimal
downside risk (standard deviation), clusters with respect to these centroids
which can be measured using the Sor- STATPC are found. The results of the algo-
tino ratio. 6 Definition 5 is the Sortino Moise and Sander proposed statistical rithm are shown to be insensitive to
ratio of strategy: significant subspace clusters (SSSCs), its parameters.10
which are subspace clusters that are On the selection of centroids, the
retD
strat − δ ret . insensitive to the parameters of their user can set a threshold to select stocks
SortinoRatioD
strat =
riskDstrat algorithm STATPC.8 The number of that have good historical price returns
stocks in the statistical significant as the centroids.
We conduct different stock-picking cluster is significantly more than ex-
strategies and evaluate their results pected, under the assumption that the Experiments
by the following experiments: data is uniformly distributed. We coded all algorithms in C++, and
• Average returns across years. We de- SSSCs are 2D subspace clusters O × A, their codes or programs were kindly
note T as the set of years used to thus we require a postprocessing step provided by their respective authors.
test a strat. For each testing dataset, to convert the 2D SSSCs to 3D. Given We performed all experiments using
we calculate Dt, t ∈ T, the average a dataset D, which contains a set of computers with Intel Core 2 Quad
return retD D
strat, and SortinoRatio strat time stamps T, we mine SSSCs from 3.0-GHz CPUs with 8 Gbytes of RAM.
of the stocks bought. each year t ∈ T, and we try all pos- We used Windows 7 except for experi-
• Overall average returns and risks. sible combinations of them to obtain ments involving TRICLUSTER, which
We calculate the overall average re- 3D SSSCs. That is, a 3D SSSCs C = O we performed in Ubuntu 10.10
turn of a strat by averaging retDtstrat, × A × T is formed if there exists 2D We conducted the experiments in
∀ t ∈ T, and we calculate the overall SSSCs O × A, ∀t ∈ T. accordance with the parameters pre-
risk of this strategy by averaging the sented in the “Research Design” sec-
downside standard deviation of retDtt
strat, MIC tion. We set the risk-free return at dret = 0
∀ t ∈ T. We also calculate the overall Correlated subspace clusters (CSCs) are in our experiments.
Sortino ratios of the strategies using insensitive to the parameters of their al- For TRICLUSTER, we fixed its
the overall average return and risk. gorithm, MIC.9 Unlike SSSCs, CSCs are minimum size parameters to minO = 5,
3D and they don’t require the assump- minA = 2, and minT = 3, and varied
3D Subspace-Clustering tion of uniformly distributed data. its similarity parameters as e = 1 and
Algorithms A 3D subspace cluster is a CSC when d = 0, 0.1, 0.01, as it’s not possible to
We use a wide range of 3D subspace- it satisfies the following criterion: the test on all possible combinations of its
clustering algorithms in our experiments. 3D subspace cluster C = O × A × T is parameters.
Sortino ratio
Average return
CATS CATS
0.4 2
0.3
0.2 1
0.1
0 0
–0.1
–0.2 –1
–0.3
–0.4 <–2
89 90 91 92 93 94 95 96 97 98 89 90 91 92 93 94 95 96 97 98
Year Year
(a) (b)
Graham
TRI
100
STATPC
MIC
80
CATS
% of stocks bought
60
40
20
0
89 90 91 92 93 94 95 96 97 98
Year
(c)
Figure 2. The different 3D-subspace-clustering strategies’ (a) average returns and (b) Sortino ratios across the years. Each year
on the x-axis denotes the start of a ten-year test period. (c) Percentage of stocks bought.
For STATPC, we used its default clustering process. For CATSeeker, a positive returns. However, this ap-
setting a 0 = 10 −10, aK = aH = 10 −3. stock o is selected as a centroid if its proach generated substantial losses in
For MIC, we used its default setting CAGR(o, t, t + 9) is at least minret. the last two datasets, notably generat-
p-value = 10 −4. For CATSeeker, we ing an 80 percent loss in D1998. Hence,
used its default setting t = 0.1; m = 10; Average Returns Across Years TRICLUSTER produced pretty vola-
d = 0.001; l = 0.1; and varied r = 0.2, Figures 2a and 2b present the average tile results. Strategy with volatile re-
0.3, 0.4, as its results are shown to be returns and Sortino ratios of the dif- sults naturally generates high returns
insensitive to this range of r.10 ferent 3D subspace-clustering strate- in the datasets when it’s profitable;
On the use of training datasets, TRI- gies, on testing datasets Dt, t ∈ {1989, TRICLUSTER based strategy has the
CLUSTER, STATPC, and MIC used …, 1998}. highest Sortino ratio in datasets D1991,
t
Dmin ret
, as these algorithms don’t con- TRICLUSTER-based strategy gen- D1993, and D1994. However, strategies
sider the stocks’ price returns during erated good positive returns in the ini- with less volatile results also outper-
their clustering process. CATSeeker tial seven datasets. In dataset D1992, it formed their volatile peer in certain
used Dt, as this algorithm consid- even has a Sortino ratio of infinity, as years, as CATSeeker based strategy
ers the stocks’ price returns during its returns of all the stocks picked have has the highest Sortino ratio in D1996,
The Authors
Kelvin Sim is a scientist at the Data Analytics Department, Institute for Infocomm Re-
search, Singapore, which is part of the Agency for Science, Technology, and Research. across the 10 testing datasets and pres-
His research interests include financial data mining, subspace clustering, graph mining, ent the results in Figure 3a. STATPC-,
co-clustering, and activities of daily living recognition. Sim has a PhD in computer en-
gineering from Nanyang Technological University, Singapore. Contact him at shsim@ MIC-, and CATSeeker-based strategies
i2r.a-star.edu.sg. have zero risk, as they have positive
average returns across the 10 testing
Vivekanand Gopalkrishnan is the director of research at Deloitte Analytics Institute
Asia. His research interests include efficient algorithms for mining interesting item sets, datasets. Figure 3b presents the over-
subspace clustering, mining in P2P networks, outlier detection, and data warehousing. all Sortino ratio across the 10 testing
Gopalkrishnan has a PhD in computer science (data warehousing) from City University datasets, which shows that Graham’s
of Hong Kong. Contact him at vivek@deloitte.com.
strategy has a high Sortino ratio.
Clifton Phua is the security and fraud analytics lead at SAS. His research interests include However, STATPC-, MIC-, and CAT-
data mining, fraud detection, activity recognition, and intelligent monitoring. Clifton has Seeker-based strategies have higher
a PhD in information technology from Monash University, Australia. He’s a member of
IEEE. Contact him at clifton.phua@sas.com. Sortino ratios than Graham’s strategy.
Among these 3D-subspace-clustering
Gao Cong is an assistant professor at Nanyang Technological University, Singapore. His strategies, MIC- and CATSeeker-based
research interests include geospatial keyword queries and mining social media. Cong has
a PhD in computer science from the National University of Singapore. Contact him at strategies have higher average return
gaocong@ntu.edu.sg. and lower risk than Graham’s strategy.
MIC
Sortino ratio
ADVERTISER INFORMATION
Northeast, Midwest, Europe, Middle East: Advertising Sales Representatives (Jobs Board)
Ann & David Schissler
Email: a.schissler@computer.org, d.schissler@computer.org
Phone: +1 508 394 4026 Heather Buonadies
Fax: +1 508 394 1707 Email: h.buonadies@computer.org
Phone: +1 973 304 4123
Fax: +1 973 585 7071