Академический Документы
Профессиональный Документы
Культура Документы
in Department Store
Wencai Liu, Yu Luo
College of Automation, Chongqing University, Chongqing 400044, China
{ liuwc @cqu.edu.cn,luoyu0521}@sina.com.cn
Absbuct- Data mining is used for knowledge discovering
from mass data to support decision making. The
implementation aspects of applying clustering data mining
method to customer analysis of department store are studied in
this paper. A dah warehouse is built based on the OLTP
database of Chongqing Liangbai Department Store. Two data
mining models are applied to the aaalysis of customer
characteristicsand the relationshipbetweencustomers and the
product categories. The mining results are analyzed.
Keywords: data warehouse; data mining; clustering
analysis
I. INTRODUCTION
Many retail trade enterprises have accumulated huge
amount of data from their OLTP (On-Line Analysis
Processing) systems, which might have been used for many
years. How to use the data to provide decision support for
managers is paid more and more attention today.
Data Warehouse, On-Line Analysis Processing (OLAP)
and Data Mining technologies hihe been developed to help
us to solve data analysis problems. Data Warehouse stores
and manages huge mass of data in such a way that the data
can be easily analyzed, OLAP provides an efficient way to
query data in various dimensions and levels, Data Mining
can discover valuable unknown knowledge, such as
relationships, rules and patterns.
Data Mining uses complex algorithms to analyze data and
create models to represent information in the data. These
models can predict the characteristics of new data or
recognize the data entities that have similar characteristics.
There are many data mining types, such as association rules,
decision trees, clustering, neural networks, etc.
The applications of data mining in retail trade enterprises
are mainly concentrated in association rules. This kind of
data mining can discover such kind of rules that customers
probably buy products B while buying products A[51. As
various kinds of the customer cards, which can provide
information about customers, has been adapted in many
retail trade enterprises, the customer analysis of retail trade
enterprises using data mining technologies becomes realistic,
This paper discusses clustering data mining applications in
customer analysis of department stores. The algorithms of
cIustering are briefly summarized first, then the data
warehouse based on the OLTP (On-Line Transaction
Processing) database of Chongqing Liangbai Department
U. CLUSTERING ALGORITHMS
Clustering data mining divides data objects into several
groups or clusters automatically so that the objects in the
same group have high similarities, and the objects in
different groups have big differences. Clustering algorithms
have been studied for many years. The typical algorithms
are k-means and k-medoids
In k-means, every group is
represented by the average value of the objects in the group.
The clustering process is: randomly choose k objects, each
object initially represents the average of each group, assign
each remaining object to one of the k groups to which the
object has the shortest distance. CalcuIate the average of
each group, and regroup all objects according to the distance
to the new averages. Do average and regroup steps again
until the groups are no longer changed. In k-medoids, each
group is represented by the object that is at the center of the
group. The clustering process is similar to k-means.
Above algorithms need scan instance data many times.
Microsoft@ provided an efficient algorithm - Scdable
Expectation Maximization (SEM) [21(31, which only scans
data one time. The basic idea is to create clusters by the
density of the instance objects. The calculation process can
be stopped anywhere and restarted again. We can get a
reasonable result at any point in the process. The algorithm
creates some clusters when processing data records, and
changes the centers of the clusters while more data are
processed in order to find a duster set that can describe the
characteristics of similar instance objects best.
Department Store.
three entities above are the data sources of the fact tables of
the data waiehouse.
There is no batch infomation of products in the data from
the POS (point of sell) terminals. The system accumulates
the sales quantity for every product by day, and then
decomposes them into product batches by the rule of T i t in
and first out. This process adds batch information to the
sold products, therefore the products have prime cost
infomation, and we can do profit analyses.
There are three types of sell counters in the store: normal,
joint and rent. There is no product information for joint and
rent counters in the OLTP database. Each joint and rent
counter uses only one product code, and uses counters name
as the product name. Therefore we can not do product
analyses for products sold in these kinds of counters.
I__---
p _ _
pnducts
product sub c a t e g q
\!I
j
~
Rddm12
__
-_._
<M>
L-cuslomers
r -1
customer-integration
customer-name
lD-cardpnumber
gender
bi rth-year-month
phongnumber
mail-address
education-level
I occupation
iI
product-name
product-speafication
product-bar-code
unit
. ._
I--
<Mz
cM>
cM>
iMz
L ~-___..-~I--__
sales-transaction
_
.-_I
transaction IO
sal e-date
sale-time
cashier
<Mr
cah-payment
bank-card-payment
chectpament
1
store-card-payment
expences-integration
.I. .-
1
,
customers
int
<pk>
char varying(20)
customer-ID
customer-name
customer-number
gender
age-segmen t
education level
occupation
mail address
phone-number
char (10)
char (2)
char (6)
char(6)
char varying(l0)
char varying(30)
char (16)
_.._I
\--
product-ID
product-code
product-name
produc t-speci ficat ion
sub-category
category
supper-category
-1
FK_SALESFACT,
.
int
<pk>
char (10)
char varying330)
char varying (30)
char varying(l0)
char varying(l0)
char varying(l0)
ERENCE_PRODUCFS
sales-fact
customer-ID
product-ID
date-ID
counter-ID
quantity_of_sale
amount-of-money
grossgrof i t
-
_ - _ ~ _ _ _ _ products
_ _ --
int
int
int
int
decimal
money
money
<fkl>
<fk2>
(fk3>
<fk4)
date-ID
I date
day
'month
I season
i year
I
I
int
char(l0)
char (9)
<pk> I
f
char(9)
char(2)
char(4)
~-
C.
Data abstracting and loading
Wq abstract data from the OLTF' database and load it into
the dimension tables and the fact tabie of the data warehouse
shown in Fig.2. The dimension tables should be loaded first,
and then the fact table. From Fig. 1 and Fig. 2 we can see
that it is simple to get data for dimension tables, but complex
for the fact table. The customer dimension table can get data
directly from the customer table of the OLTP'database, and
insert a special record of '1999999999' as card number. The
counter dimension table should be get data jointly from the
department table and the counter table of the OLTP database.
And the product dimension table is loaded with data jointly
retrieved from the super category, category, subcategory
and product table of the OLTP database. An agent key is set
for each dimension table. The value of the key is assigned
automatically by DBMS while record is inserted.
Sales fact data are abstracted as foIlows: fist, compute the
average prime cost for each product sold in each day from
counters
-~
counter-ID
counter-code
counter-name
counter-type
department-name
- -
--
--
int
<pk)
char(4)
char varying (10)
char(4)
char varying(l0)
__ __- - -
.-
- - .- - -
RESULTS
A. CZusrering of custumer characteristics
By inquiry of the sales data in 2003 we know that there are
only 15% records have customer information. Although the
percentage is small, it still has certain representative role.
The sales data in 2003 is used in this paper for 'customer
analysis. Therefore we need a new sales fact table, named as
2003_sales, which only be loaded with data that has
customer information in 2003 from the sales fact table. In
,,
the same time, the values of the age segment field of the
customer dimension table should be assigned according to
customer's birth year.
In order for clustering the instances of customer, it is
needed to aggregate sales facts grouped by customer. We
can avoid this step by clustering based on an OLAP multidimension data set, in which the gender and age segment
fields are set as two properties of customer (identified by
card number), We create the data mining model in
MicrosoftEd Analysis Manager[21as foIlows:
Data type: OLAP; multi dimension data set: 2003-sales;
Data mining alggon'thm: Microsoft clustering; Dimension:
customer; Level: customer card number; Training data:
customer age segment, customer gender, sales amount of
money; Number of cluster: 10.
The mining results show that in all the 6485 instances,
female customers take 75%, customers with age above 35
take 62%, the average expenses per customer are 537 Yuan
RMB. The main characteristics of customers of the 10
clusters are shown as Tab. 1
TABLE 1
cluster
set as follows:
Datu type: relational data; Fact table: 2003-sales-normai;
Dimension tables: customers, products; Mining algorithm:
Microsoft clustering; Key of instunce: product ID; Training
data: customer age segment, customer gender, supper
category of products, sales amount of money; Number of
cluster: 10.
TABLE 2
MAINCHARACTORISTIC! IF CUSTOMES AND PRODUCT
CAI GORIE!
Cluster
instance
propor-
Customer
main age
tion
segments
main
Main
gender
product
14.8%
30-.
A35-40,
F
(57%)
(67%)
2
f 2.9%
11.9%
segment
A30-40,
5@+,
(57%)
85%
A40-50
(56%)
(69%)
13.0%
A25-.4[k
F (94%)
12.0%
A30-.5 t h .
10.64
9.9%
9.0%
3.7%
Spare time
A%, 35-40.
50+ (72%)
A35-40. SO+
10.35
F (98%)
VI33
F (92%)
.
,
Y171
F (63%)
Y291
Y486
cakes,
Daily use
d c l e s , Span
time foods
A30-40(52%)
A30-,35-40,
F (59%)
(64%)
6
9.8%
F
(69%)
9.1%
TI550
10
Y4.24
(55%)
5
6
cakes. Spare
time foods
(59%)
11.2
Cosmetics
(54951
,- ..,
Daily use
articles, Spare
time foods
Candies &
customer
&
11.1%
Percentag
11.5%
Amount
of
instance
Y886
(67%)
9.0%
8.5%
F
(72%)
Y13.16
alcohol,
cosmetics.
172.77
Y 18.36
Y33.52
10
Daily use
articles,
Cosmetics,
Drinking
powders
(54%)
Cosmetics,
Daily use
articles, Spare
time foods
(52%)
Cosmetics,
Grains & oils,
Drinking
2.5%
- 1045 -
TV. CONCLUSION
Clustering data mining can be used to group instance data
automatically according to their similarities to get useful
modes contained in the data. Enterprise managers can get
decision supports by inquiring the mining results. Clustering
analysis can provide an approximate understandmg to a
problem, and can usually point out other areas that need to be
inquired. Clustering is usually the first step of data analyses.
It is obvious that the customer analyses are very important
for retail trade enterprises. This paper provides two
application examples of clustering in customer analyses in
- 1046-
REFERENCES
Jiawei Han and Micheline Umber, Data Mining: Concepts and
Techniques. San Francisco: Morgan Kaufmann Publishers, Inc 2001.
Tony Bain etc., Professional SQL Server 2000 Data Warehousing
with Analysis Services.Wrox Press, 2001.
Paul S. Bradley, Usama M. Fayyad and Cory A. Reina. (bctober
1999). Scaling EM (Expectation Maximization) Clustering to h g e
Databases.
Available:
f t p : / l f t p . ~ s ~ ~ h . ~ ~ ~ ~ ~ . ~ ~ ~ ~ ~ b / t r / t
W.H.Inmon. Building the data warehouse. NJ: John Wiley & Sons,
Inc Press, 1996.
R Agrawal and T Imielinski, A Swami. Mining Association Rules
Between Sets of Items In Large Database, Proc.ACM SIGMDO, 1993:
207-216.
Shangwang Bai and Weichao Dang, PowerDesigner Software
Engineering Technology. Beijing: Publishing House of Electronics
Industry, 2004.