Вы находитесь на странице: 1из 5

Applications of Clustering Data Mining in Customer Analysis

in Department Store
Wencai Liu, Yu Luo
College of Automation, Chongqing University, Chongqing 400044, China
{ liuwc @cqu.edu.cn,luoyu0521}@sina.com.cn
Absbuct- Data mining is used for knowledge discovering
from mass data to support decision making. The
implementation aspects of applying clustering data mining
method to customer analysis of department store are studied in
this paper. A dah warehouse is built based on the OLTP
database of Chongqing Liangbai Department Store. Two data
mining models are applied to the aaalysis of customer
characteristicsand the relationshipbetweencustomers and the
product categories. The mining results are analyzed.
Keywords: data warehouse; data mining; clustering

analysis

I. INTRODUCTION
Many retail trade enterprises have accumulated huge
amount of data from their OLTP (On-Line Analysis
Processing) systems, which might have been used for many
years. How to use the data to provide decision support for
managers is paid more and more attention today.
Data Warehouse, On-Line Analysis Processing (OLAP)
and Data Mining technologies hihe been developed to help
us to solve data analysis problems. Data Warehouse stores
and manages huge mass of data in such a way that the data
can be easily analyzed, OLAP provides an efficient way to
query data in various dimensions and levels, Data Mining
can discover valuable unknown knowledge, such as
relationships, rules and patterns.
Data Mining uses complex algorithms to analyze data and
create models to represent information in the data. These
models can predict the characteristics of new data or
recognize the data entities that have similar characteristics.
There are many data mining types, such as association rules,
decision trees, clustering, neural networks, etc.
The applications of data mining in retail trade enterprises
are mainly concentrated in association rules. This kind of
data mining can discover such kind of rules that customers
probably buy products B while buying products A[51. As
various kinds of the customer cards, which can provide
information about customers, has been adapted in many
retail trade enterprises, the customer analysis of retail trade
enterprises using data mining technologies becomes realistic,
This paper discusses clustering data mining applications in
customer analysis of department stores. The algorithms of
cIustering are briefly summarized first, then the data
warehouse based on the OLTP (On-Line Transaction
Processing) database of Chongqing Liangbai Department

Store (CLDS) is introduced, the customer characteristics and


the relationship between customers and the categories of
products are analyzed using Microsoft@ clustering
algorithm.

U. CLUSTERING ALGORITHMS
Clustering data mining divides data objects into several
groups or clusters automatically so that the objects in the
same group have high similarities, and the objects in
different groups have big differences. Clustering algorithms
have been studied for many years. The typical algorithms
are k-means and k-medoids
In k-means, every group is
represented by the average value of the objects in the group.
The clustering process is: randomly choose k objects, each
object initially represents the average of each group, assign
each remaining object to one of the k groups to which the
object has the shortest distance. CalcuIate the average of
each group, and regroup all objects according to the distance
to the new averages. Do average and regroup steps again
until the groups are no longer changed. In k-medoids, each
group is represented by the object that is at the center of the
group. The clustering process is similar to k-means.
Above algorithms need scan instance data many times.
Microsoft@ provided an efficient algorithm - Scdable
Expectation Maximization (SEM) [21(31, which only scans
data one time. The basic idea is to create clusters by the
density of the instance objects. The calculation process can
be stopped anywhere and restarted again. We can get a
reasonable result at any point in the process. The algorithm
creates some clusters when processing data records, and
changes the centers of the clusters while more data are
processed in order to find a duster set that can describe the
characteristics of similar instance objects best.

III. CREATING DATA WAREHOUSE


A Data Warehouse is a subject-oriented, integrated,
non-volatile, and time variant collection of data in support of
managements decision making process 14]. The database of
data warehouse consists of fact tables and dimension tables
that are structured as star forms or snowflake forms. The
data in data warehouse are come from OLTP database and
other data sources of the enterprise through extracting,
organizing and transforming. The data source used in this
paper is the OLTP database of Chongqing Liangbai

- 10420-7803-8971-9/05/$20.00 @2005 EEE

Department Store.

A. About the datu source


The management information system of Chongqing
Liangbai Department Store was installed in 1998. From
August 2002, customer card has been used to integrate the
amount of customer purchases, Since then some customer
information has been collected,
The structure of the OLTP database related to sales
analysis is shown in Fig. 1,which is a conceptual data model
(CDM) designed with PowerDesigner L61. The Sales
Transaction entity contains the data of each transaction (time,
amount of money, type of payment, customer integration,
etc.). Sales Product entity contains the data of products sold
in each transaction (quantity, price,discount rate, amount of
money, etc.). Inventory entity contains account data for
every batch of store purchases (bakh number, in-out-time,
in-out-type, quantity, prime cost, amount of money). The

three entities above are the data sources of the fact tables of
the data waiehouse.
There is no batch infomation of products in the data from
the POS (point of sell) terminals. The system accumulates
the sales quantity for every product by day, and then
decomposes them into product batches by the rule of T i t in
and first out. This process adds batch information to the
sold products, therefore the products have prime cost
infomation, and we can do profit analyses.
There are three types of sell counters in the store: normal,
joint and rent. There is no product information for joint and
rent counters in the OLTP database. Each joint and rent
counter uses only one product code, and uses counters name
as the product name. Therefore we can not do product
analyses for products sold in these kinds of counters.

I__---

p _ _

pnducts

product sub c a t e g q

\!I

j
~

Rddm12

__

-_._

<M>

L-cuslomers

r -1
customer-integration
customer-name
lD-cardpnumber
gender
bi rth-year-month
phongnumber
mail-address
education-level
I occupation

iI

product-name
product-speafication
product-bar-code
unit
. ._
I--

<Mz
cM>

cM>
iMz

L ~-___..-~I--__

sales-transaction
_

.-_I

transaction IO
sal e-date
sale-time
cashier
<Mr
cah-payment
bank-card-payment
chectpament
1
store-card-payment
expences-integration

.I. .-

1
,

Fig. 1Part of the E-Rmodel of CLDS MIS (FowerDesigner CDM)

B. The dimension tables and fact tables


According to the requirements of clustering analyses of
this paper, we only need two dimension tables: customer and
product. Taking the requirements of the whole sails analyses
into account; there should also be a time dimension table and
a counter dimension table. Using star structure, we design
the data warehouse for sales analyses as Fig, 2, which is a
PowerDesigner physical data model (PDM) [61. All
dimension tables are shared dimensions.

Because only part of the sales records has customer


information, there should be a special record in the customer
dimension table (here we use card number 1999999999) to
represent unknown customers to ensure reference integrity.
There is an age segment field in customer dimension table
for the needs of analysis of age characteristics of customer.
Because age is time variable, we should construct a sales fact
table, called year sales fact table, which only has one years
sales data, and set the values of age segment in customer
dimension table correspondingly.

The sales fact table in Fig. 2 has dl sales records of all


counters (include joint and rent counters), but only the
normal counter's records have actual product information,
therefore we should construct another sales fact table, called
normal counter sales fact table, in which the sales records of
joint and rent counters have been deleted, for the analyses
related to products.
The customer information available from the OLTP data
is customer name, age and gender. ID card is required when
a customer apply customer card, so the above information is
available for all customers. Most customers do not fill

customers
int
<pk>
char varying(20)

customer-ID
customer-name
customer-number
gender
age-segmen t
education level
occupation
mail address
phone-number

char (10)
char (2)
char (6)
char(6)
char varying(l0)
char varying(30)
char (16)

_.._I

\--

additional information, such as education level, profession,


etc., into the application forms, so we can not analyze other
characteristics of customers. When analyze customer
characteristics we need year sales fact table; and when we
analyze the relationships between customer characteristics
and product categories, we need normal counter sales fact
table. In the same time, the two fact tables should not
contain records that have no actual customer information.

product-ID
product-code
product-name
produc t-speci ficat ion
sub-category
category
supper-category

-1

FK_SALESFACT,
.

int
<pk>
char (10)
char varying330)
char varying (30)
char varying(l0)
char varying(l0)
char varying(l0)

ERENCE_PRODUCFS

sales-fact

customer-ID
product-ID
date-ID
counter-ID
quantity_of_sale
amount-of-money
grossgrof i t
-

_ - _ ~ _ _ _ _ products
_ _ --

int
int
int
int
decimal
money
money

<fkl>
<fk2>

(fk3>
<fk4)

date-ID
I date
day
'month
I season
i year
I
I

int
char(l0)
char (9)

<pk> I
f

char(9)
char(2)
char(4)

~-

C.
Data abstracting and loading
Wq abstract data from the OLTF' database and load it into
the dimension tables and the fact tabie of the data warehouse
shown in Fig.2. The dimension tables should be loaded first,
and then the fact table. From Fig. 1 and Fig. 2 we can see
that it is simple to get data for dimension tables, but complex
for the fact table. The customer dimension table can get data
directly from the customer table of the OLTP'database, and
insert a special record of '1999999999' as card number. The
counter dimension table should be get data jointly from the
department table and the counter table of the OLTP database.
And the product dimension table is loaded with data jointly
retrieved from the super category, category, subcategory
and product table of the OLTP database. An agent key is set
for each dimension table. The value of the key is assigned
automatically by DBMS while record is inserted.
Sales fact data are abstracted as foIlows: fist, compute the
average prime cost for each product sold in each day from

counters

-~
counter-ID
counter-code
counter-name
counter-type
department-name

- -

--

--

int
<pk)
char(4)
char varying (10)
char(4)
char varying(l0)

__ __- - -

.-

- - .- - -

the inventory table and store the averages in a temporary


table. Second, retrieve sales quantity, amount of money,
prime cost jointly from sales product table and the prime cost:
temporary table, and caculate the gross profit for each
product sold. In the second step, all the dimension tables
should also be jointed into the retrieve to get the joint key
values of the sales fact table.

IV. DATA MINING MODELS AND m

RESULTS
A. CZusrering of custumer characteristics
By inquiry of the sales data in 2003 we know that there are
only 15% records have customer information. Although the
percentage is small, it still has certain representative role.
The sales data in 2003 is used in this paper for 'customer
analysis. Therefore we need a new sales fact table, named as
2003_sales, which only be loaded with data that has
customer information in 2003 from the sales fact table. In

,,

the same time, the values of the age segment field of the
customer dimension table should be assigned according to
customer's birth year.
In order for clustering the instances of customer, it is
needed to aggregate sales facts grouped by customer. We
can avoid this step by clustering based on an OLAP multidimension data set, in which the gender and age segment
fields are set as two properties of customer (identified by
card number), We create the data mining model in
MicrosoftEd Analysis Manager[21as foIlows:
Data type: OLAP; multi dimension data set: 2003-sales;
Data mining alggon'thm: Microsoft clustering; Dimension:
customer; Level: customer card number; Training data:
customer age segment, customer gender, sales amount of
money; Number of cluster: 10.
The mining results show that in all the 6485 instances,
female customers take 75%, customers with age above 35
take 62%, the average expenses per customer are 537 Yuan
RMB. The main characteristics of customers of the 10
clusters are shown as Tab. 1
TABLE 1

cluster

set as follows:
Datu type: relational data; Fact table: 2003-sales-normai;
Dimension tables: customers, products; Mining algorithm:
Microsoft clustering; Key of instunce: product ID; Training
data: customer age segment, customer gender, supper
category of products, sales amount of money; Number of
cluster: 10.
TABLE 2
MAINCHARACTORISTIC! IF CUSTOMES AND PRODUCT

CAI GORIE!

Cluster

instance

propor-

Customer
main age

tion

segments

main

Main

gender

product

14.8%

30-.
A35-40,

F
(57%)

(67%)
2

f 2.9%

11.9%

segment

A30-40,

5@+,
(57%)

85%

A40-50

(56%)

(69%)

13.0%

A25-.4[k

F (94%)

12.0%

A30-.5 t h .

10.64

9.9%

9.0%
3.7%

Spare time

A%, 35-40.
50+ (72%)
A35-40. SO+

10.35

F (98%)

VI33

F (92%)
.
,

Y171

F (63%)

Y291
Y486

cakes,
Daily use
d c l e s , Span
time foods

A30-40(52%)
A30-,35-40,

F (59%)

(64%)
6

9.8%

F
(69%)

9.1%

TI550

10

Y4.24

(55%)

5
6

cakes. Spare
time foods

(59%)

11.2

Cosmetics
(54951
,- ..,
Daily use
articles, Spare
time foods

Candies &

customer

&

11.1%

Percentag

11.5%

Amount
of

instance

MAIN CHARACTORISTICS IF CUSTOMER


hstadce
Maid
Average

Y886

(67%)

We can see from the mining results that the main


customers of the store are female people above 35 years old;
the less the number of instances in a cluster, the higher the
average expenses per customer; male customers with the age
between 30 to 45 has a very high expenses (the 10" cluster) .
These analyses results can help the managers improve their
management tactics. For example, increase the products that
satisfy the needs of the main customer group more in order to
increase their expenses; design golden card to increase the
proportion of the high expenses customers; etc.

9.0%

8.5%

F
(72%)

Y13.16

alcohol,
cosmetics.

172.77

Y 18.36

Y33.52

10

Clustering of customer churaeteristicsand products


In order to further inquire the relationship between
customer characteristics and products, we cluster customers
together with product categories. For this purpose we create
a new sales fact table, named as 2003_sales_normal, and
load it with data of normal counters with customer
information in 2003 from 2003-sales. The mining model is

Daily use
articles,
Cosmetics,
Drinking
powders
(54%)
Cosmetics,
Daily use
articles, Spare
time foods
(52%)
Cosmetics,
Grains & oils,
Drinking

2.5%

The mining results show that in all the 59051 instances,


female customers take 69%, customers with age above 35
take 60%. This result indicates that the main customer group
for normal counters is the same as the main customer group

- 1045 -

of the store. The product categories expensed by main


customer group are mainly concentrated in 5 categories:
daily use articles, cosmetics, spare time foods, dry foods,and
candies & cakes, which represent the product categories that
are most frequently bought by customers. The instances of
above categories take 71% of the total sales instances, and
these categories take 28% of the total 18 product categories
in the store. This result approximately fills the "two-eight
rule".
Inspiring the 10 clusters we can get the main
characteristics of each cluster shown as Tab. 2, which
represents the relationships between customer characteristics
and product categories. These relationships can be used to
make advertising, sale promotions, and customer receptions
more specific and more effective. These results can also be
used to direct OLAP inquiring to get more valuable decision
support information.

department store, and gives the whole process of realizing


the analyses. These examples show that the clustering data
mining is useful and effective. Retail enterprises that have
costumer information can use clustering data mining
techniques as the examples provided in this paper to get
valuable information for decision making.
Using Microsoft@ Analyses Services, the data mining
itself is simple. The main work to do the mining is building
data warehouse, which is case specific and needs domain
knowledge and infomation system skills. Because only
partial customer information can be available, the results of
clustering analyses in this paper have certain limitations for
decision support.

TV. CONCLUSION
Clustering data mining can be used to group instance data
automatically according to their similarities to get useful
modes contained in the data. Enterprise managers can get
decision supports by inquiring the mining results. Clustering
analysis can provide an approximate understandmg to a
problem, and can usually point out other areas that need to be
inquired. Clustering is usually the first step of data analyses.
It is obvious that the customer analyses are very important
for retail trade enterprises. This paper provides two
application examples of clustering in customer analyses in

- 1046-

REFERENCES
Jiawei Han and Micheline Umber, Data Mining: Concepts and
Techniques. San Francisco: Morgan Kaufmann Publishers, Inc 2001.
Tony Bain etc., Professional SQL Server 2000 Data Warehousing
with Analysis Services.Wrox Press, 2001.
Paul S. Bradley, Usama M. Fayyad and Cory A. Reina. (bctober
1999). Scaling EM (Expectation Maximization) Clustering to h g e
Databases.
Available:
f t p : / l f t p . ~ s ~ ~ h . ~ ~ ~ ~ ~ . ~ ~ ~ ~ ~ b / t r / t
W.H.Inmon. Building the data warehouse. NJ: John Wiley & Sons,
Inc Press, 1996.
R Agrawal and T Imielinski, A Swami. Mining Association Rules
Between Sets of Items In Large Database, Proc.ACM SIGMDO, 1993:
207-216.
Shangwang Bai and Weichao Dang, PowerDesigner Software
Engineering Technology. Beijing: Publishing House of Electronics
Industry, 2004.

Вам также может понравиться