Вы находитесь на странице: 1из 9

Qual Quant (2010) 44:807815 DOI 10.

1007/s11135-009-9240-0 RESEARCH NOTE

Using K-means method and spectral clustering technique in an outtters value analysis
En-Chi Chang Shian-Chang Huang Hsin-Hung Wu

Published online: 13 May 2009 Springer Science+Business Media B.V. 2009

Abstract This study applies K-means method and spectral clustering technique in the customer data analysis of an outtter in Taipei City, Taiwan. The data set contains transaction records of 551 customers from April 2004 to March 2006. The differences between the two clustering techniques mentioned here are significant. K-means method is more capable of dealing with linear separable input, while spectral clustering technique might have the advantage in non-linear separable input. Thus, it would be of interest to know which clustering technique performs better in a real-world case of evaluating customer value when the type of input space is unknown. By using cluster quality assessment, this study found that spectral clustering technique performs better than K-means method. To summarize the analysis, this study also suggests marketing strategies for each cluster based on the results generated by spectral clustering technique. Keywords K-means method Spectral clustering technique Cluster quality assessment Marketing strategy Customer value

1 Introduction The outtter industry becomes more competitive in Taiwan and consequently the prot is decreasing. It is important for these outtters to create a well-managed customer database, to identify customers with high value or prot potentials and then to customize marketing strategies for these customers. By allocating and utilizing resources effectively and efciently, these outtters can both fulll different customers needs and achieve better customer retention and profitability. As the transaction record of a company becomes much larger in size,

E.-C. Chang Manchester Business School, Booth Street West, Manchester, M15 6PB, UK ) H.-H. Wu S.-C. Huang (B Department of Business Administration, National Changhua University of Education, No. 2 Shida Road, Changhua City, Changhua, 500 Taiwan e-mail: shhuang@cc.ncue.edu.tw

123

808

E.-C. Chang et al.

data mining techniques, particularly the clustering techniques, can be applied to divide all customers into an appropriate number of clusters based on some similarities among these customers. The values of different customer groups can then be calculated and evaluated to provide useful decisional information for management. Subsequently, customized marketing strategies can be made to meet different types of customers needs. In this study, two types of clustering techniques are applied. K-means method, which is very easy to converge in terms of execution time and has been applied on large data sets, cannot separate clusters that are non-linearly separable in input space (Dhillon et al. 2004). Spectral clustering technique overcomes this major drawback by using the eigenvectors of an afnity matrix to obtain a clustering of the data with a minimum of normalized cut (Shi and Malik 2000). These two methods deal with input space differently. K-means method is more capable in dealing with linear separable input, while spectral clustering technique might have the advantage in non-linear separable input. Therefore, it would be of interest to know which clustering technique would perform better in a real-world case study when the type of input space is unknown. Besides, different from available data mining papers in marketing literature which limit the content mainly to technical discussions, this study provides practical marketing strategies for the outtter after generating the technical results. This paper is organized as follows. Section 2 reviews K-means method and spectral clustering technique. Customer value and recency-frequency-monetary are briey reviewed in Sect. 3. A case study of an outtter in Taipei City, Taiwan is analyzed and summarized in Sect. 4. Marketing implications including promotion strategies for different clusters are depicted in Sect. 5. Finally, conclusions are drawn in Sect. 6.

2 Review of K-means method and spectral clustering technique K-means method which is very sensitive to initial seed selection and even in the best case can produce only hyperspherical clusters (Jain et al. 1999; Dhillon et al. 2004). Spectral clustering technique uses the eigenvectors of an afnity matrix to obtain a clustering of the data, where its popular objective function is to minimize the normalized cut (Shi and Malik 2000). Tao and Michel (2003) have pointed out that K-means method considers the compactness of the data while spectral clustering technique considers the connectness of the data. In fact, K-means method is more suitable for linear separable clusters in input space, whereas spectral clustering technique is better for non-linear separable clusters. 2.1 Means method K-means method, a very popular non-hierarchical approach for data clustering due to its simplicity of implementation and fast execution, uses Euclidean distance to measure the distance between two points (Yoon and Hwang 1995; Wu et al. 2008). There are two major steps of K-means method (Wu et al. 2008). First, the assignment step where the instances are placed in the closest class. Second, the re-estimation step where the class centroids are recalculated from the instances assigned to the class. Kuo et al. (2002) have pointed out that K-means method can have higher accuracy if the starting point and the number of clusters are provided. However, K-means method cannot determine the number of clusters and may select randomly the starting point and the number of clusters. Under such circumstances, Kuo et al. (2002) have proposed a modied two-stage method by self-organizing feature maps (SOFM) to decide the number of clusters for K-means method. Therefore, in this study, self-organizing feature maps is applied to determine the number of clusters for K-means method.

123

Using clustering techniques in value analysis

809

2.2 Spectral clustering technique Given a set of data points A, the similarity matrix may be dened as a matrix S where Si j represents a measure of the similarity between points. Spectral clustering techniques make use of the spectrum of the similarity matrix of the data to cluster the points, which are also used to perform dimensionality reduction for clustering in fewer dimensions (Speer et al. 2005; Chang et al. 2007). Given a set of points S = {s1 , s2 , s3 , , sn } in R t . If k subsets are to be clustered, six major steps are summarized below (Speer et al. 2005; Chang et al. 2007). 1. Construct the afnity matrix A R n n dened by Ai j = exp si s j /2 2 if i = j and Aii = 0. 2. Dene D as the diagonal matrix whose (i , i )-element is the sum of the i -th row in matrix A, and form the matrix L = D 1/2 AD 1/2 . 3. Find the k largest eigenvectors of L (chosen to be orthogonal to each other in the case of repeated eigenvalues) denoted by x1 , x2 , x3 , . . ., xk , and the matrix X can be formed by X = [x1 x2 x3 . . . xk ] R n k by stacking the eigenvectors in columns. 4. Form the matrix Y by normalizing each row in matrix X to have unit length, i.e., Yi j = Xi j /
j 2

X i2j

1/2

5. Treat each row of matrix Y as a point in R k and then cluster the data into k clusters via any algorithm, such as K-means method, that attempts to minimize distortion. 6. Assign the original point si to cluster j if and only if row i of the matrix Y was assigned to cluster j . Speer et al. (2005) have concluded that spectral clustering technique is easy to apply to any type of data and gives a fast segmentation results without any human interaction. In this study, the Spider of MATLAB (version 1.71) is used, where the source code is directly available from http://www.kyb.tuebingen.mpg.de/bs/people/spider/main.html.

3 Customer value and recencyfrequencymonetary 3.1 Customer value This study aims to nd the most valuable customers for an outtter and then denes customer value as the economic value of the customer relationship to the rm (Kumer and Reinartz 2006). By incorporating the concept in marketing strategy planning, a company can optimize its utility of marketing resources. Thus, the calculation of customer value is crucial to customer relationship management. When calculating customer value, companies or academicians often use profitability indices such as contribution margins, net profits or customer life time value (Kumer and Reinartz 2006) and RFM (recency, frequency, and monetary) metrics proposed by Stone (1984) to understand customers nancial potential in direct marketing (Sheth et al. 2000; Joia and Sanz 2006). Though RFM metrics have been criticized by their oversimplicity and low statistic significance (Yang 2004), they are still widely used and discussed in marketing literature. Since the main purpose of this study is to verify the applicability of alternative clustering techniques, the promotion strategies and tactics discussed in Sect. 5 will still be based on RFM metrics.

123

810

E.-C. Chang et al.

3.2 Recencyfrequencymonetary Recency means the elapsed time since the most recent purchase; frequency refers to the number of purchases made during the investigated time and monetary means total amount of money spent on all purchases (Viaene et al. 2001). A company can sort its customers into four groups, namely strangers, butteries, barnacles and true friends, based on their profitability (i.e., monetary value) and longevity (i.e., recency and frequency) and then devise different strategies for each group (Reinartz and Kumer 2002). Thompson (2000) provides similar classication based on RFM and groups a companys customers into uncertain customers, spenders, frequent customer, and best customers, who need different marketing attention. Based on these two matrices and related loyalty strategies, this study analyzes the tables in Sect. 5 and suggests promotion strategies for different clusters.

4 A case study A case study of applying K-means method and spectral clustering technique in analyzing the data set from an outtter in Taipei City, Taiwan is conducted. The data set consists of 551 member customers who shopped between April 2004 and March 2006. The prole for each customer includes a membership number, gender, the birth date, zip code, all transactions and the total spending at the store. To determine the best number of clusters for K-means method and spectral clustering technique, this study uses SOFM recommended by Kuo et al. (2002) to generate six clusters. Therefore, the number of clusters in this study is set to six. The notations and data classications used in the analyses are discussed below. Male and female are presented by 1 and 2, respectively. The birth date is classied into seven groups. In addition, zip codes are simplied into ve regions, i.e., northern, central, southern, eastern, and islet regions. The average amount of spending per visit for each customer and the distribution of the average total spending are both classied into nine levels. Table 1 summarizes the information including the sample of size for each cluster, the clusters average shopping frequency, the clusters average total spending, average spending per visit, and average gender for both methods. Further analyses were conducted for capturing more detailed information (from Tables 2 to 6) such as the distribution of the age group, the average spending per visit, the distribution of members from different regions, and the average total spending for both methods. Each clustering technique has its own strengths in data clustering such that it requires measurement tools to evaluate which technique performs better for this outtter to plan marketing strategies. Draghici (2003) has depicted that two types of cluster quality assessment can be performed. The rst approach is to compare the size of the clusters versus the distance to the nearest cluster by min Di j /max (di ), where Di j is the distance between cluster i and cluster j , both i and j are from 1 to 6, and d j is the diameter for each cluster with j = 1, 2, 3, 4, 5, 6. If the inter-cluster distance is much larger than the size of the clusters, the cluster is considered to be more trustworthy. In this case, the value of spectral clustering technique (0.130594) is better than that of K-means method (0.080898). The second approach is to use the philosophy of the sample standard deviation (SD) to compute the average of the distances between the members of a cluster and the cluster center
n by s = i =1 ( x i x )/n 1, where n is the sample size for each cluster, and x and x are the observation and sample average in a particular cluster, respectively. Because each cluster has different sample of size, the average sample standard deviation for each clustering method

123

Using clustering techniques in value analysis Table 1 Summary of six clusters by both methods Cluster Sample size Average frequency Average total spending Average spending per visit

811

Average gender

K-means method 1 2 3 4 5 6 Spectral clustering technique 1 2 3 4 5 6 Total 123 76 115 99 119 19 3.16 2.04 2.46 2.13 2.80 2.74 12,223.28 10,718.03 10,366.82 7,005.01 10,196.00 9,743.37 7,690.42 7,295.52 6,552.58 4,576.88 6,193.90 3,923.98 1 2 1 2 1 1.21

14 172 185 165 8 7 551

1.64 2.09 2.94 2.64 1.38 4.29 2.54

4,696.93 8,601.07 10,974.81 8,422.56 71,623.75 9,138.43 10,167.26

3,425.66 5,700.73 6,261.93 5,092.37 56,832.54 5,260.73 6,385.97

1 2 1 1 1 2 1.32

6 6 2 is dened as j =1 n i j 1 si j / j =1 n i j k , where i = 1, 2, (two clustering methods), j = 1, 2, 3, 4, 5, 6 (six clusters for each method), and k = 6 (the number of clusters). In this case, spectral clustering technique has the smaller average sample standard deviation than K-means method, i.e., 1.172639 and 1.184454. In summary, spectral clustering technique outperforms K-means method in both approaches of cluster quality assessment.

5 Marketing strategies According to Table 1, Cluster 5 has the highest average total spending and average spending per visit but the lowest shopping frequencies and therefore it can be dened as big spenders (Thompson 2000). The average spending per visit of these eight customers is much larger than the average spending per visit of the whole data set. Based on the very high spending, the guess is that the purpose of these members purchase is to organize group outdoor activities. If this is true, the outtter may sign long-term contracts with these members and try to become the regular equipment supplier for them. Table 1 also shows that the average shopping frequencies are pretty low, which is below the average. To increase the shopping frequency, the outtter may encourage Cluster 5 to purchase more frequently by cooperating with these members to plan outdoor activities regularly and at the same time providing special offers linked to these plans. The outtter may also increase these customers shopping frequencies by offering coupons or through bonus collection. The second important group in Table 1 is Cluster 3. This group is the largest among all groups and has 185 members. This group shops frequently and has the second highest average total spending and average spending per visit and therefore is the core customer or the best customer (Thompson 2000). The outtter should keep frequent contacts with these members and maintain their attitude and behavioral loyalty. Besides, the outtter should try to raise

123

812 Table 2 Average amount spending per visit for each cluster by both methods Cluster 1 K-means method 5,000 5,00110,000 10,00115,000 15,00120,000 20,00125,000 25,00130,000 30,00135,000 35,00140,000 40,001 Total Spectral clustering technique 5,000 5,00110,000 10,00115,000 15,00120,000 20,00125,000 25,00130,000 30,00135,000 35,00140,000 40,001 Total 10 4 0 0 0 0 0 0 0 14 108 32 21 4 5 1 0 0 0 172 115 31 14 10 8 2 3 2 0 185 111 25 14 8 3 3 1 0 0 165 0 0 0 0 0 1 0 0 7 8 74 22 8 9 3 1 2 1 3 123 41 16 12 3 2 1 0 0 1 76 70 19 11 6 3 3 1 1 1 115 68 17 9 2 3 0 0 0 0 99 82 15 9 2 5 2 1 0 3 119 Cluster 2 Cluster 3 Cluster 4

E.-C. Chang et al.

Cluster 5

Cluster 6

13 5 0 1 0 0 0 0 0 19 4 2 0 1 0 0 0 0 0 7

Table 3 Gender distribution for each cluster by both methods Cluster 1 K-means method Male Female Total Spectral clustering technique Male Female Total 14 0 14 0 172 172 185 0 185 165 0 165 8 0 8 0 7 7 123 0 123 0 76 76 115 0 115 0 99 99 119 0 119 15 4 19 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6

the average spending per visit through coupons, bundling and sales suggestions because the majority in this group spend less than NT$5,000 per visit (Table 2), which is less than the gure for the whole data set (Table 1). The third group of focus in Table 1 is Cluster 6. Members in this group are all female (Table 3) and live far away from where the store is located (Table 4). This group shops most frequently but the total average spending and average spending per visit are below the average

123

Using clustering techniques in value analysis Table 4 Distribution of customers from different regions in Taiwan by both methods Cluster 1 K-means method Northern Central Southern Eastern Islet area Total Spectral clustering technique Northern Central Southern Eastern Islet area Total 0 2 8 4 0 14 171 1 0 0 0 172 183 2 0 0 0 185 165 0 0 0 0 165 8 0 0 0 0 8 0 3 3 0 1 7 122 1 0 0 0 123 73 3 0 0 0 76 115 0 0 0 0 115 98 1 0 0 0 99 119 0 0 0 0 119 0 3 11 4 1 19 Cluster 2 Cluster 3 Cluster 4 Cluster 5

813

Cluster 6

Table 5 Age group for each cluster by both methods Cluster 1 K-means method Group 1 (25 and below) Group 2 (2630) Group 3 (3135) Group 4 (3640) Group 5 (4145) Group 6 (4650) Group 7 (51 and above) Total Spectral clustering technique Group 1 (25 and below) Group 2 (2630) Group 3 (3135) Group 4 (3640) Group 5 (4145) Group 6 (4650) Group 7 (51 and above) Total 4 4 2 1 0 2 1 14 24 48 39 26 16 11 8 172 18 35 52 39 22 10 9 185 13 41 36 29 20 14 12 165 3 2 2 1 0 0 0 8 1 4 2 0 0 0 0 7 15 28 26 28 12 8 6 123 10 22 19 7 8 7 3 76 7 31 27 16 15 10 9 115 14 26 23 19 8 4 5 99 12 18 36 25 16 6 6 119 5 7 3 1 0 2 1 19 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6

of the whole data set. They are frequent shoppers but do not spend a lot at the store. In addition, 5 of 7 customers in this group are aged below 30 (Table 5). Therefore, the customers of this group can be characterized as young students without much money but like to visit Taipei City and the store for fun. The outtter can look for products with lower cost of sales but acceptable quality (Reinartz and Kumer 2002). Alternatively, the outtter may encourage these customers to bring friends to shop in the store by providing discounts or to become

123

814 Table 6 Distribution of the average total spending by both methods Cluster 1 K-means method 5,000 5,00110,000 10,00115,000 15,00120,000 20,00125,000 25,00130,000 30,00135,000 35,00140,000 40,001 Total Spectral clustering technique 5,000 5,00110,000 10,00115,000 15,00120,000 20,00125,000 25,00130,000 30,00135,000 35,00140,000 40,001 Total 8 5 1 0 0 0 0 0 0 14 90 31 21 13 7 4 1 1 4 172 82 38 23 8 9 7 4 5 9 185 85 22 24 17 7 6 2 1 1 165 0 0 0 0 0 0 0 0 8 8 53 20 20 8 5 3 2 3 9 123 33 13 13 7 3 3 1 1 2 76 54 17 15 13 5 5 2 2 2 115 58 19 8 7 4 1 0 0 2 99 58 25 12 4 6 5 2 1 6 119 Cluster 2 Cluster 3 Cluster 4

E.-C. Chang et al.

Cluster 5

Cluster 6

11 5 1 0 0 0 1 0 1 19 4 1 0 1 0 0 1 0 0 7

friends of the store to positively inuence other customers while they are in the store. Since these customers may have prot potentials in the long run, another tactics is to retain them by providing discount cards. The fourth group which needs special attention is Cluster 1. This group has the lowest shopping frequencies, the lowest average total spending and the lowest average spending per visit (Table 1). This may be because these customers are new, quite young (Table 5), have to travel far to shop at the store (Table 4), or shop at the store purely out of curiosity or by chance. The outtter should investigate whether these customers are new or one-time shoppers. If these customers belong to the latter case, the outtter should ignore them because they are not profitable (Reinartz and Kumer 2002; Thompson 2000). The performance of the last two groups, namely Cluster 2 and Cluster 4, is mediocre in terms of shopping frequency, average total spending and average spending per visit. The outtter may use coupons to both attract these customers to shop more frequently and to spend more in the store (Table 6).

6 Conclusions This study uses two approaches of clustering techniques in grouping an outtters customers. K-means method is more capable in dealing with linear separable input, while spectral

123

Using clustering techniques in value analysis

815

clustering technique might have the advantage in non-linear separable input. The data set in the case study comes from an outtter in Taipei City, Taiwan. The result of cluster quality assessment shows that spectral clustering technique outperforms K-means method. Marketing strategies are then suggested, based on the results of spectral clustering technique. In summary, Cluster 5 can be dened as big spenders, and the outtter should focus on increasing shopping frequency and prolonging the relationship with these customers. Cluster 3 is considered to be the core customer or the best customer. The outtter should try to retain this groups loyalty and to increase this groups spending per visit.

References
Chang, E.-C., Huang, S.-C., Wu, H.-H., Lo, C.-F.: A case study of applying spectral clustering technique in the value analysis of an outtters customer database. In: 2007 IEEE International Conference on Industrial Engineering and Engineering Management, pp. 17431746. Singapore (2007) Dhillon, I.S., Guan, Y., Kulis, B.: Kernel kmeans, spectral clustering and normalized cuts. In: Proceedings of Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD04), pp. 551556. Seattle, Washington, USA (2004) Draghici, S.: Data Analysis Tools for DNA Microarrays. Chapman & Hall/CRC, New York (2003) Jain, A.K., Murty, M.N., Flynn, P.J.: Data clustering: a review. ACM Comput. Surv. 31(3), 264323 (1999) Joia, L.A., Sanz, P.S.: The nancial potential of sporadic customers in e-retailing: evidence from the Brazilian home appliance sector. J. Electron. Commer. Organiz. 4(1), 1833 (2006) Kumer, V., Reinartz, W.: Customer Relationship Management: A Database Approach, Hoboken. Wiley, NJ (2006) Kuo, R.-J., Ho, L.M., Hu, C.M.: Integration of self-organizing feature map and K-means algorithm for market segmentation. Comput. Oper. Res. 29(11), 14751493 (2002) Reinartz, W., Kumer, V.: The mismanagement of customer loyalty. Harv. Bus. Rev. 80(7), 8694 (2002) Sheth, J.N., Sisodia, R.S., Sharma, A.: The antecedents and consequences of customer-centric marketing. J. Acad. Market. Sci. 28(1), 5566 (2000) Shi, J., Malik, J.: Normalized cuts and image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 22(8), 888 905 (2000) Speer, N., Spieth, C., Zell, A. : Spectral clustering gene ontology terms to group genes by function. In: Casadio, R., Myers, G. Lecture Notes in Bioinformatics, vol. 3692, pp. 112. Springer, Berlin (2005) Stone, B.: Successful Direct Marketing Methods, 3rd edn. NTC Publishing, Lincolnwood, IL (1984) Tao, X., Michel, H.E.: Classication of multispectral satellite image data using improved NRBF neural networks. In: Proceedings of SPIEInternational Society for Optical Engineering, vol. 5267, pp. 311320 (2003) Thompson, H.: The Customer-Centered Enterprise. McGraw-Hill, New York (2000) Viaene, S., Baesens, B., den Poel, D., Dedene, G., Vanthienen, J.: Wrapped input selection using multilayer perceptrons for repeat-purchase modeling in direct marketing International. J. Intell. Syst. Account. Financ. Manag. 10(2), 115127 (2001) Wu, H.-H., Lin, S.-Y., Liao, A.Y.H., Shieh, J.-I.: An application of the generalised K-means algorithm in decision-making processes. Int. J. Oper. Res. 3(1/2), 1935 (2008) Yang, X.: How to develop new approaches to RFM segmentation. J. Target. Meas. Anal. Mark. 13(1), 50 61 (2004) Yoon, K., Hwang, C.L.: Multiple Attribute Decision Making: An Introduction. Sage Publication, California (1995)

123