Вы находитесь на странице: 1из 12

Data Clustering for Heterogeneous Data

Teresa Goncalves, Pedro Melgueira

Universidade de Evora
tcg@uevora.pt, m11153@alunos.uevora.pt

Filipe Clerigo, Ricardo Raminhos, Rui Estevao


VIATECLA SA
clerigo@viatecla.com, rraminhos@viatecla.com,
restevao@viatecla.com

Abstract. The continuous growth on the volume of information / data


does not mean a proportional increase on its related knowledge. The
existence of automatic analysis mechanisms based on Artificial Intelligence algorithms (either supervised or not) can represent an important
added value when these mechanisms are naturally integrated in repositories that are specialized in managing big volumes of content (i.e. CMS
Content Management Systems).
As some of these repositories are open, they allow a high level of flexibility
for organisations that use them, as it is possible to freely model their
business data structures. However, it also means they are not restricted
to a certain domain of specific information which brings a great challenge
on the way data is interpreted and analysed by AI algorithms (not being
known beforehand) either for detecting contents sharing similar group
characteristics, as well as the most important attributes for performing
data analysis.
This is the main purpose of the SMART Content Provider prototype.
The current paper considers and presents the results obtained in the
area of AI algorithms for clustering and attribute suggestion analysis,
applied to open repositories of data.

The SMART Content Provider (CP) Project

Through the Smart CP [2] project, investigation on enhancing Intelligence on


CMS environments was performed under three main pillars:

Enhance mechanisms of aggregation of heterogeneous information (where


the structures and objects are not known before hand);
Definition of Artificial Intelligence Algorithms, in particular in the area of
the detection of patterns on semi-structured information;
Mechanisms of data presentation applied to results / contents, exploring
non-conventional formats and ways of information representation that will
contribute to a more fluid knowledge exploration.
The knowledge resulting from this investigation has been materialized in a
prototype for a generic platform for data visualization and interaction, referred
to as SMART Content Provider (CP), a project developed by VIATECLA [21]

and supported by Universidade de Evora


[3] and GTE Consultores [7], and cofinanced by QREN (Quadro de Referencia Estrategico Nacional) [18].
The present paper focuses only on the second element of the project related
with the detection and suggestion of possible patterns present on data based
on applying AI algorithms and evaluation heuristics. A general presentation of
the project, in terms of its objectives, architecture and results, can be found on
the paper SMART Content Provider [14], whilst the detailed presentation of
graphical components for data representation and exploration is available on the
paper SMART Data Visualization and Exploration [15].
1.1

Architecture

Figure 1.1 shows a global vision for the SMART CP architectural platform. A
three colour scheme is used to characterize its functional blocks that compose
the platform or external interactions:
Orange: Completely external to the platform, with which the SMART CP
platform interacts to obtain data / contents;
Green: Functional blocks with which the SMART CP platform is integrated,
i.e. regarding the native content management system that supports the platform;
Purple: Native blocks from the SMART CP platform.
The architecture for the SMART CP platform follows a classic client / server
paradigm, as presented in Figure 1.1. Blocks regarding the server component are
represented on the top of the image, and on the bottom are present the blocks
showing the client related components. Because Smart CP platform uses data
/ contents present in content management systems, all client functional groups
(i.e. data sorting, data visuals and exploration, accountability and workflows)
are integrated in the content management system backoffice itself.
The functional block SMART Analyser is responsible for all AI related
data processing, analysis and suggestion, consisting on the main scope for the
current paper, presenting its internal functioning and approaches followed during
its implementation.

Fig. 1. General diagram of the architecture of the platform SMART CP.

2
2.1

State of The Art - Clustering Algorithms and Feature


Extraction
Clustering

Clustering is the process of finding groups of objects from a dataset. The clustering process creates groups such that the objects in a group are more similar
within themselves than with the objects in other groups. Clustering is usually
done to data that is not yet classified or divided in anyway.
One of the first difficulties in clustering is in finding the characteristics that
better characterize the objects. It might happen that a dataset contains information that is simply not useful, or it might contain information which is only
useful after some transformation. Regardless, the features for the objects are
taken in some numerical representation. [24]
Formally, there is a dataset S. An object is a feature vector o S. For example, suppose a small dataset that stores bug reports, without much information,
comprising 4 fields.

(Pr) Project;
(Re) Relevance;
(We) WeekDay;
(De) Description.

An object from this dataset would be represented by a vector,


o = [P r, Re, W e, De].
2.2

Performance Measures

Since clustering is an unsupervised method, any performance evaluation must be


done using the clustering model itself. Regardless of being supervised or not, a
performance measure for clustering will look into how similar the objects of one
cluster are within themselves, and how the objects of one cluster are dissimilar to
the objects of another cluster. As a general rule, the objects of one cluster should
be very similar to themselves while the objects on different clusters should be
very dissimilar.
For performance evaluation in unsupervised learning, a known measure called
Silhouette Coefficient [19] is used. This measure takes two variables, a and b, into
account. These variables are built from the clustered data. The value for a is the
mean distance from the same sample object and all other objects in the same
cluster. The value of b is the mean distance between the same object and all
other objects in other clusters.
Having these values the following is calculated,
s=

ba
.
max(a, b)

The result of Silhouette Coefficient, s, is a real value between -1 and 1. The


closest the value is to 1, the better the clustering is. A value close to 0 means
that the clusters are overlapping. Values close to -1 indicate that the objects are
mostly assigned to the wrong clusters.
2.3

K-Means and Variants

The first approach to many clustering tasks uses K-Means [13,22,1,11,17]. This
algorithm has a very general definition and is the staring point with different
practices. The algorithm is parametrized with the number of clusters it should
find, so the number of clusters might not be optimal.
Because the algorithm doesnt find the optimal number of clusters, alternative methods must be used to find that number. One such method deals with
executing the same algorithm for many different parameters. Thus, it is possible
to find different clusters and find the best one according to the performance
measures. Other methods to find a better parametrization are discussed in the
following sections.
K-Means uses the concept of centroid to perform. A centroid is an object
which has the same features as the objects in S. Given a subset Z S, the
centroid c(Z) is the object which features are the average of all the objects in
Z. Formally,
c(Z) = [c0 , . . . , cm1 ],

ci (Z) =

1X
zi ,
n
zZ

where m is the number of features of the objects, and n is the size of Z.


The algorithm progresses by updating the position of the centroids. The algorithm ends when the position of the centroids no longer changes from iteration
to iteration. In each iteration step, the proximity of each element to the centroid
is calculated. Proximity may vary from problem to problem. In general, the Euclidean Distance is used. Formally, for two arbitrary objects a and b, the distance
is defined as
dist(a, b) =

sX
(ai bi )2 .
i

The distance gives a sense of proximity between two objects. The closest two
objects are, the smaller the value of the distance is. For a distance of 0, the
objects are considerate to be equal.
Other distance may be used. For example, the Manhattan distance,
dist1 (a, b) = ka bk =

|(ai bi )|.

Another very useful distance is the Euclidean Square distance.


2

dist (a, b) =

!2
sX
X
2
(ai bi )
=
(ai bi )2 ,
i

This distance is similar to the normal Euclidean distance, but it is computationally simpler, because it doesnt need any square root calculation. This
distance is not a real metric because it doesnt follow the triangular inequality
rule, but it can be used as such.
K-Means has an inconvenience regarding certain data types. The algorithm
works for data which is nominal and sortable. It must also be possible to have
arithmetic operations applied to the data. Data is nominal when its features
may be distinguish in some way. Two operators, = and 6=, may be defined,
a = b, i ai = bi ,
a 6= b, i ai 6= bi .
The first states that a and b are equal because all of their features are equal.
The second states that they are not equal because at least one of their features
isnt.
Sortable data must also be nominal and it must be possible to define some
order, for example, if the data is lexicographically sortable, or if it represents

some kind of ranks. Finally, data which may have arithmetic operations applied
to it is always numeric.
K-Modes [8,9] is a variant of the K-Means algorithm made to deal with nominal data. The difference between these two algorithms fall in how the distance
function is defined, which is more of a similarity function,
d(a, b) =

(ai , bi ),

(
1, ai = bi
(ai , bi ) =
.
0, ai =
6 bi
Because the datasets in this project are almost entirely nominal, it only makes
sense to work with K-Modes and not with K-Means.
2.4

Affinity Propagation

Affinity Propagation [5,23,10,12,4] is a clustering algorithm which doesnt require


the number of clusters as an initial parameter. The algorithm needs a way to
define distance or similarity between objects. For some objects oi , oj , ok S, if
oi is more similar to oj then it is to ok , then the following must be true,
s(oi , oj ) > s(oi , ok ).
The algorithm uses two matrices that get updated in each step of the iteration. They are the responsibility matrix R, and the availability matrix A. The
values of R, for example, Ri,k , show how much element i is suited to represent k.
This is a relative measure as opposed to any other element of the dataset. The
values of A, for example, Ai,k , state how appropriate is to have i pick k has its
representative.
Each iteration of the algorithm updates both matrices until there is convergence. First, the responsibility matrix is updated using the rule,
Ri,k = s(oi , ok ) max
{Ai,k0 + s(oi , o0k )}.
0
k 6=k

Then, the following rules updates the availability matrix,


p = Rk,k +

max(0, Ri0 ,k ),

i0 6{i,k}

Ai,k = min(0, p), i 6= k,


Ak,k =

X
i0 6=k

max(0, Ri0 ,k ).

2.5

Joining Affinity Propagation and K-Modes

In this project, results from the two mentioned algorithms are used. The clustering process starts by using the Affinity Propagation algorithm. Upon the
completion of this process, the available results show which objects in the data
correspond to which cluster, and how many clusters were found by the algorithm
itself.
Having the number of clusters as they are calculated by Affinity Propagation
not only gives a good estimate for the number of clusters in data, but also gives a
starting point for clustering using K-Modes. Supposing that Affinity Propagation
yielded N clusters, the K-Modes is going to be performed N I, N (I + 1),
. . . , N + I times, with I being some positive integer value. This will, in turn,
yield a number of clusters. A final analysis is done over all of the results. The
clustering solution that performs better, according to the Silhouette Coefficient
performance measure, is picked for further analysis.
2.6

Feature Extraction

The end result of this project aims at displaying only two or three features and
how the data is distributed between those features. The final display is capable
of displaying data arranged by two features in a table-like fashion, having the
two axis of the table set for two features. Each cell in the table has several points
which are randomly scattered across it. The points may have some color, form,
or size associated with them, therefore it is possible for this table to display a
third, fourth, or fifth feature.
The clustering tasks will find clusters with more features then just two or
three, so there needs to be a process that finds the most interesting groups of
features so they can be displayed to the end user. This process is called Feature
Extraction.
The features that are extracted are the ones that have the best distribution
of data. Having a good distribution of data means that those features alone are
able to display distinct clusters in the data. A function is defined that conveys
what a good distribution is. The function is based on the notion of entropy.
A conditional probability distribution is defined of the form,
P (C|F1 , F2 , . . . , Fm ),
where Fk is a feature, m is the total number of features, C is a cluster. The
distribution states the probability of an object with the given features belonging
to cluster C. Entropy over such distribution, with any set of features, is going
to be close to 0 if those features are representative of the clustering distribution.
So, it is stated the values which is closer to 0 are better. Entropy is therefore
the heuristic used when finding a good set of features.
The process that finds these distributions tries different sets of features and
keeps the ones that perform better, according to the heuristic.

Architecture

The high level architectural abstraction for the SmartCP prototype is presented
in figure 3, which is based in the flow of four main blocks that work sequentially. Having a dataset made available by the CMS in a JSON format as input,
the pre-processing block starts and is responsible for filtering and normalizing
data from external platforms. The following block, for clustering, receives an
already normalized output in a numeric CSV format (having all of the textual
information mapped to numeric values which are easier to process). This block
has the objective of obtaining aggregation sets. In the Extraction of Relevant
Attributes block, the best attributes which contribute for a better distribution of content is determined. These attributes are found having the previously
found clusters taken into account. Finally, the output block, which is responsible
for the output data formating, yields the best clustering solution and the best
attributes.

Fig. 2. High level view of the SMART CP prototypes architecture.

Considering its complexity, it is possible to breakdown the pre-processing


phase into 4 steps, as it is shown in figure 3. In the first step, fields and data
that are considered useless are removed. In the Enumeration Mapping, the
substitution of nominal by numeric attributes is made. This conversion is necessary in order to ease the clustering algorithms. The Date Handling step
processes the fields with dates, distributing those dates by year, month and day
of the week. Finally, the data is normalized in a CSV format to be consumed by
further phases.

Fig. 3. Functional detail of the pro-processing block.

The Clustering phase may also be broken down into steps (Figure 5). Initially,
the Affinity Propagation algorithm is applied. This algorithm yields a clustering
solution and the number of clusters N it found. The value of N is later used as
basis for clustering using the K-Modes algorithm. Clustering with K-Modes is
done several times with different variants of the proposed N .

Fig. 4. Functional detail of the clustering block.

Finally, the Evaluation block will use performance measures to determined


which are the best clustering solutions. Finally, this blocks yields the clusters
found.

4
4.1

Implementation
Pre-processing

The datasets that are used for this project are formated as JSON files. The json
file is a list of objects. Along with this file there is a schema that states the data
type for each field of the objects, along other meta data which is not used in
this project. The fields are the same for every object.
Nominal data tends to have string-like data types, such as comments and
titles. Clustering is about finding similarities in data. In this project, no effort is
made to mine information from natural language, so unless a text field is empty,
there is going to be one unique value for each non-empty field in the dataset
and no similarities will be found like this. To avoid this problem, such fields are
replaced by a boolean value, which states what is empty and what is not.
Some fields have small domains that are still text based. This is not exactly a
problem for clustering, however, it is computationally complex for the algorithms
to deal with comparing strings. To simplify the process, such fields are mapped
to an integer value. For example, if there is a field called Importance in which
the domain is Not Important, Important, and Very Important, then the values of
the domain get mapped for 0, 1, and 2, respectively.

Dates have a similar problem to that one of string-like fields. They will have
very diverse values scattered through time, and those, if only the whole date is
taken into account, will always be dissimilar. To avoid this problem dates can
be transformed into useful information. This is currently done by keeping only
the year, month, and day of the week in the dataset. With these alterations it
is possible to have similarities in terms of dates.
The output of this phase will be a CSV file. This format is used because
they are more or less of a norm in term of Machine Learning and Data Mining
algorithms and because most algorithm implementations are ready for this kind
of data. The file has a header with the name of each field along with a line for
each object. The file is not alone. There is also a file as output of this phase that
contains dictionaries for the mapping process.
This whole process is done using a Python 3 script. No third party libraries
were used.
4.2

Clustering and Feature Extraction

Clustering is done using two algorithms, K-Modes, which is the variant of KMeans, and Affinity Propagation. As described in 2.5, the clustering process
starts with the Affinity Propagation algorithm. This algorithm will take the
pre-processed dataset as input and will yield a relation of objects to calculated
clusters. It will also yield the number of clusters it found.
The number of clusters then be used to parametrize the execution of the KModes algorithm, so it Affinity Propagation specified that there were N clusters,
K-Modes will be executed for N I, N (I + 1), . . . , N + I.
In the end of the clustering part there will be one clustering model fitted
using Affinity Propagation and several models fitted using different K-Modes
parametrizations.
Feature extraction is done by testing different feature groups using the heuristic function. The five groups that perform better are the ones that are kept.
The implementation used for K-Modes is [6], which is a Python library build
over Numpy [16] and is distributed under the MIT license. The implementation
for Affinity Propagation is in the SckiKit Python library [20].
4.3

Output

There are two outputs to take into account. The clustering output and the
features extraction output. The clustering output should state which objects
belong to which cluster. A cluster is identified by a positive integer number,
while the objects are identified by their id. The output consists in having a
JSON with the format in figure 5.
The output of the feature extraction part is a list of groups, which in term
contain a list of fields of the dataset. These groups state that the fields in it were
considered interesting for displaying, according to the definitions given. Figure
6 shows an example of this JSON.

The JSON format is used for interoperability reasons between the various
products and applications that are used with this project.
[
{"Cluster": 0, "Ids": [...]},
{"Cluster": 1, "Ids": [...]},
...
{"Cluster": n, "Ids": [...]}
]

Fig. 5. JSON for clustering result.

[
["Importance", "WeekDay"]
["Importance", "Project"]
...
["Project", "Description"]
]

Fig. 6. JSON for feature extraction.

Future Work

Having the clustering implementation complete, further experiments will have


to be done to it in order to assure its correctness and usefulness. From these
experiments it will be possible to observe the performance of both algorithm
and the performance of the discussed method which joins them. Regarding feature extraction, a method to search for the optimal features will be developed.
That method will be based on state space search with heuristics. Finally, it is
fundamental for the project to be tested with wider datasets of different nature.
These experiments will allow any observation on the overall performance of the
developed system.

References
1. P. S. Bradley and Usama M. Fayyad. Refining initial points for k-means clustering.
pages 9199. Morgan kaufmann, 1998.

2. Microsite SMART CP. http://www.viatecla.com/inovacao/smart content provider.


2015.

3. U. de Evora.
http://www.uevora.pt/. 2015.
4. Delbert Dueck and Brendan J. Frey. Non-metric affinity propagation for unsupervised image categorization.
5. Brendan J. Frey and Delbert Dueck. Clustering by passing messages between data
points. Science, 315:2007, 2007.
6. K-Modes GitHub. https://github.com/nicodv/kmodes. 2015.
7. GTE. http://www.gte.pt/. 2015.
8. Zhexue Huang. Clustering large data sets with mixed numeric and categorical
values. In In The First Pacific-Asia Conference on Knowledge Discovery and Data
Mining, pages 2134, 1997.
9. Zhexue Huang. Extensions to the k-means algorithm for clustering large data sets
with categorical values. Data Min. Knowl. Discov., 2(3):283304, September 1998.
10. Tao Li. A general model for clustering binary data. In Proceedings of the Eleventh
ACM SIGKDD International Conference on Knowledge Discovery in Data Mining,
KDD 05, pages 188197, New York, NY, USA, 2005. ACM.
11. Aristidis Likas, Nikos A. Vlassis, and Jakob J. Verbeek. The global k-means clustering algorithm. Pattern Recognition, 36(2):451461, 2003.
12. Zhengdong Lu and M.A. Carreira-Perpinan. Constrained spectral clustering
through affinity propagation. In Computer Vision and Pattern Recognition, 2008.
CVPR 2008. IEEE Conference on, pages 18, June 2008.
13. J. Macqueen. Some methods for classification and analysis of multivariate observations. In In 5-th Berkeley Symposium on Mathematical Statistics and Probability,
pages 281297, 1967.
14. Filipe Clerigo & Ricardo Raminhos & Rui Estev
ao & Teresa Goncalves & Pedro
Melgueira. Smart content provider. 2015.
15. Filipe Clerigo & Ricardo Raminhos & Rui Estev
ao & Teresa Goncalves & Pedro
Melgueira. Smart data visualization and exploration. 2015.
16. Numpy. http://www.numpy.org/. 2015.
17. Dan Pelleg and Andrew W. Moore. X-means: Extending k-means with efficient
estimation of the number of clusters. In Proceedings of the Seventeenth International Conference on Machine Learning, ICML 00, pages 727734, San Francisco,
CA, USA, 2000. Morgan Kaufmann Publishers Inc.
18. QREN. http://www.qren.pt/np4/home. 2015.
19. Peter J. Rousseeuw. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 20:53
65, 1987.
20. SciKit-Learn. http://scikit-learn.org/. 2015.
21. Site Institucional. VIATECLA. http://www.viatecla.com. 2015.
22. Kiri Wagstaff, Claire Cardie, Seth Rogers, and Stefan Schroedl. Constrained kmeans clustering with background knowledge. In In ICML, pages 577584. Morgan
Kaufmann, 2001.
23. Kaijun Wang, Junying Zhang, Dan Li, Xinna Zhang, and Tao Guo. Adaptive
affinity propagation clustering. CoRR, abs/0805.1096, 2008.
24. Ian H. Witten and Eibe Frank. Data Mining: Practical Machine Learning Tools
and Techniques, Second Edition (Morgan Kaufmann Series in Data Management
Systems). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2005.

Вам также может понравиться