Академический Документы
Профессиональный Документы
Культура Документы
Universidade de Evora
tcg@uevora.pt, m11153@alunos.uevora.pt
Architecture
Figure 1.1 shows a global vision for the SMART CP architectural platform. A
three colour scheme is used to characterize its functional blocks that compose
the platform or external interactions:
Orange: Completely external to the platform, with which the SMART CP
platform interacts to obtain data / contents;
Green: Functional blocks with which the SMART CP platform is integrated,
i.e. regarding the native content management system that supports the platform;
Purple: Native blocks from the SMART CP platform.
The architecture for the SMART CP platform follows a classic client / server
paradigm, as presented in Figure 1.1. Blocks regarding the server component are
represented on the top of the image, and on the bottom are present the blocks
showing the client related components. Because Smart CP platform uses data
/ contents present in content management systems, all client functional groups
(i.e. data sorting, data visuals and exploration, accountability and workflows)
are integrated in the content management system backoffice itself.
The functional block SMART Analyser is responsible for all AI related
data processing, analysis and suggestion, consisting on the main scope for the
current paper, presenting its internal functioning and approaches followed during
its implementation.
2
2.1
Clustering is the process of finding groups of objects from a dataset. The clustering process creates groups such that the objects in a group are more similar
within themselves than with the objects in other groups. Clustering is usually
done to data that is not yet classified or divided in anyway.
One of the first difficulties in clustering is in finding the characteristics that
better characterize the objects. It might happen that a dataset contains information that is simply not useful, or it might contain information which is only
useful after some transformation. Regardless, the features for the objects are
taken in some numerical representation. [24]
Formally, there is a dataset S. An object is a feature vector o S. For example, suppose a small dataset that stores bug reports, without much information,
comprising 4 fields.
(Pr) Project;
(Re) Relevance;
(We) WeekDay;
(De) Description.
Performance Measures
ba
.
max(a, b)
The first approach to many clustering tasks uses K-Means [13,22,1,11,17]. This
algorithm has a very general definition and is the staring point with different
practices. The algorithm is parametrized with the number of clusters it should
find, so the number of clusters might not be optimal.
Because the algorithm doesnt find the optimal number of clusters, alternative methods must be used to find that number. One such method deals with
executing the same algorithm for many different parameters. Thus, it is possible
to find different clusters and find the best one according to the performance
measures. Other methods to find a better parametrization are discussed in the
following sections.
K-Means uses the concept of centroid to perform. A centroid is an object
which has the same features as the objects in S. Given a subset Z S, the
centroid c(Z) is the object which features are the average of all the objects in
Z. Formally,
c(Z) = [c0 , . . . , cm1 ],
ci (Z) =
1X
zi ,
n
zZ
sX
(ai bi )2 .
i
The distance gives a sense of proximity between two objects. The closest two
objects are, the smaller the value of the distance is. For a distance of 0, the
objects are considerate to be equal.
Other distance may be used. For example, the Manhattan distance,
dist1 (a, b) = ka bk =
|(ai bi )|.
dist (a, b) =
!2
sX
X
2
(ai bi )
=
(ai bi )2 ,
i
This distance is similar to the normal Euclidean distance, but it is computationally simpler, because it doesnt need any square root calculation. This
distance is not a real metric because it doesnt follow the triangular inequality
rule, but it can be used as such.
K-Means has an inconvenience regarding certain data types. The algorithm
works for data which is nominal and sortable. It must also be possible to have
arithmetic operations applied to the data. Data is nominal when its features
may be distinguish in some way. Two operators, = and 6=, may be defined,
a = b, i ai = bi ,
a 6= b, i ai 6= bi .
The first states that a and b are equal because all of their features are equal.
The second states that they are not equal because at least one of their features
isnt.
Sortable data must also be nominal and it must be possible to define some
order, for example, if the data is lexicographically sortable, or if it represents
some kind of ranks. Finally, data which may have arithmetic operations applied
to it is always numeric.
K-Modes [8,9] is a variant of the K-Means algorithm made to deal with nominal data. The difference between these two algorithms fall in how the distance
function is defined, which is more of a similarity function,
d(a, b) =
(ai , bi ),
(
1, ai = bi
(ai , bi ) =
.
0, ai =
6 bi
Because the datasets in this project are almost entirely nominal, it only makes
sense to work with K-Modes and not with K-Means.
2.4
Affinity Propagation
max(0, Ri0 ,k ),
i0 6{i,k}
X
i0 6=k
max(0, Ri0 ,k ).
2.5
In this project, results from the two mentioned algorithms are used. The clustering process starts by using the Affinity Propagation algorithm. Upon the
completion of this process, the available results show which objects in the data
correspond to which cluster, and how many clusters were found by the algorithm
itself.
Having the number of clusters as they are calculated by Affinity Propagation
not only gives a good estimate for the number of clusters in data, but also gives a
starting point for clustering using K-Modes. Supposing that Affinity Propagation
yielded N clusters, the K-Modes is going to be performed N I, N (I + 1),
. . . , N + I times, with I being some positive integer value. This will, in turn,
yield a number of clusters. A final analysis is done over all of the results. The
clustering solution that performs better, according to the Silhouette Coefficient
performance measure, is picked for further analysis.
2.6
Feature Extraction
The end result of this project aims at displaying only two or three features and
how the data is distributed between those features. The final display is capable
of displaying data arranged by two features in a table-like fashion, having the
two axis of the table set for two features. Each cell in the table has several points
which are randomly scattered across it. The points may have some color, form,
or size associated with them, therefore it is possible for this table to display a
third, fourth, or fifth feature.
The clustering tasks will find clusters with more features then just two or
three, so there needs to be a process that finds the most interesting groups of
features so they can be displayed to the end user. This process is called Feature
Extraction.
The features that are extracted are the ones that have the best distribution
of data. Having a good distribution of data means that those features alone are
able to display distinct clusters in the data. A function is defined that conveys
what a good distribution is. The function is based on the notion of entropy.
A conditional probability distribution is defined of the form,
P (C|F1 , F2 , . . . , Fm ),
where Fk is a feature, m is the total number of features, C is a cluster. The
distribution states the probability of an object with the given features belonging
to cluster C. Entropy over such distribution, with any set of features, is going
to be close to 0 if those features are representative of the clustering distribution.
So, it is stated the values which is closer to 0 are better. Entropy is therefore
the heuristic used when finding a good set of features.
The process that finds these distributions tries different sets of features and
keeps the ones that perform better, according to the heuristic.
Architecture
The high level architectural abstraction for the SmartCP prototype is presented
in figure 3, which is based in the flow of four main blocks that work sequentially. Having a dataset made available by the CMS in a JSON format as input,
the pre-processing block starts and is responsible for filtering and normalizing
data from external platforms. The following block, for clustering, receives an
already normalized output in a numeric CSV format (having all of the textual
information mapped to numeric values which are easier to process). This block
has the objective of obtaining aggregation sets. In the Extraction of Relevant
Attributes block, the best attributes which contribute for a better distribution of content is determined. These attributes are found having the previously
found clusters taken into account. Finally, the output block, which is responsible
for the output data formating, yields the best clustering solution and the best
attributes.
The Clustering phase may also be broken down into steps (Figure 5). Initially,
the Affinity Propagation algorithm is applied. This algorithm yields a clustering
solution and the number of clusters N it found. The value of N is later used as
basis for clustering using the K-Modes algorithm. Clustering with K-Modes is
done several times with different variants of the proposed N .
4
4.1
Implementation
Pre-processing
The datasets that are used for this project are formated as JSON files. The json
file is a list of objects. Along with this file there is a schema that states the data
type for each field of the objects, along other meta data which is not used in
this project. The fields are the same for every object.
Nominal data tends to have string-like data types, such as comments and
titles. Clustering is about finding similarities in data. In this project, no effort is
made to mine information from natural language, so unless a text field is empty,
there is going to be one unique value for each non-empty field in the dataset
and no similarities will be found like this. To avoid this problem, such fields are
replaced by a boolean value, which states what is empty and what is not.
Some fields have small domains that are still text based. This is not exactly a
problem for clustering, however, it is computationally complex for the algorithms
to deal with comparing strings. To simplify the process, such fields are mapped
to an integer value. For example, if there is a field called Importance in which
the domain is Not Important, Important, and Very Important, then the values of
the domain get mapped for 0, 1, and 2, respectively.
Dates have a similar problem to that one of string-like fields. They will have
very diverse values scattered through time, and those, if only the whole date is
taken into account, will always be dissimilar. To avoid this problem dates can
be transformed into useful information. This is currently done by keeping only
the year, month, and day of the week in the dataset. With these alterations it
is possible to have similarities in terms of dates.
The output of this phase will be a CSV file. This format is used because
they are more or less of a norm in term of Machine Learning and Data Mining
algorithms and because most algorithm implementations are ready for this kind
of data. The file has a header with the name of each field along with a line for
each object. The file is not alone. There is also a file as output of this phase that
contains dictionaries for the mapping process.
This whole process is done using a Python 3 script. No third party libraries
were used.
4.2
Clustering is done using two algorithms, K-Modes, which is the variant of KMeans, and Affinity Propagation. As described in 2.5, the clustering process
starts with the Affinity Propagation algorithm. This algorithm will take the
pre-processed dataset as input and will yield a relation of objects to calculated
clusters. It will also yield the number of clusters it found.
The number of clusters then be used to parametrize the execution of the KModes algorithm, so it Affinity Propagation specified that there were N clusters,
K-Modes will be executed for N I, N (I + 1), . . . , N + I.
In the end of the clustering part there will be one clustering model fitted
using Affinity Propagation and several models fitted using different K-Modes
parametrizations.
Feature extraction is done by testing different feature groups using the heuristic function. The five groups that perform better are the ones that are kept.
The implementation used for K-Modes is [6], which is a Python library build
over Numpy [16] and is distributed under the MIT license. The implementation
for Affinity Propagation is in the SckiKit Python library [20].
4.3
Output
There are two outputs to take into account. The clustering output and the
features extraction output. The clustering output should state which objects
belong to which cluster. A cluster is identified by a positive integer number,
while the objects are identified by their id. The output consists in having a
JSON with the format in figure 5.
The output of the feature extraction part is a list of groups, which in term
contain a list of fields of the dataset. These groups state that the fields in it were
considered interesting for displaying, according to the definitions given. Figure
6 shows an example of this JSON.
The JSON format is used for interoperability reasons between the various
products and applications that are used with this project.
[
{"Cluster": 0, "Ids": [...]},
{"Cluster": 1, "Ids": [...]},
...
{"Cluster": n, "Ids": [...]}
]
[
["Importance", "WeekDay"]
["Importance", "Project"]
...
["Project", "Description"]
]
Future Work
References
1. P. S. Bradley and Usama M. Fayyad. Refining initial points for k-means clustering.
pages 9199. Morgan kaufmann, 1998.
3. U. de Evora.
http://www.uevora.pt/. 2015.
4. Delbert Dueck and Brendan J. Frey. Non-metric affinity propagation for unsupervised image categorization.
5. Brendan J. Frey and Delbert Dueck. Clustering by passing messages between data
points. Science, 315:2007, 2007.
6. K-Modes GitHub. https://github.com/nicodv/kmodes. 2015.
7. GTE. http://www.gte.pt/. 2015.
8. Zhexue Huang. Clustering large data sets with mixed numeric and categorical
values. In In The First Pacific-Asia Conference on Knowledge Discovery and Data
Mining, pages 2134, 1997.
9. Zhexue Huang. Extensions to the k-means algorithm for clustering large data sets
with categorical values. Data Min. Knowl. Discov., 2(3):283304, September 1998.
10. Tao Li. A general model for clustering binary data. In Proceedings of the Eleventh
ACM SIGKDD International Conference on Knowledge Discovery in Data Mining,
KDD 05, pages 188197, New York, NY, USA, 2005. ACM.
11. Aristidis Likas, Nikos A. Vlassis, and Jakob J. Verbeek. The global k-means clustering algorithm. Pattern Recognition, 36(2):451461, 2003.
12. Zhengdong Lu and M.A. Carreira-Perpinan. Constrained spectral clustering
through affinity propagation. In Computer Vision and Pattern Recognition, 2008.
CVPR 2008. IEEE Conference on, pages 18, June 2008.
13. J. Macqueen. Some methods for classification and analysis of multivariate observations. In In 5-th Berkeley Symposium on Mathematical Statistics and Probability,
pages 281297, 1967.
14. Filipe Clerigo & Ricardo Raminhos & Rui Estev
ao & Teresa Goncalves & Pedro
Melgueira. Smart content provider. 2015.
15. Filipe Clerigo & Ricardo Raminhos & Rui Estev
ao & Teresa Goncalves & Pedro
Melgueira. Smart data visualization and exploration. 2015.
16. Numpy. http://www.numpy.org/. 2015.
17. Dan Pelleg and Andrew W. Moore. X-means: Extending k-means with efficient
estimation of the number of clusters. In Proceedings of the Seventeenth International Conference on Machine Learning, ICML 00, pages 727734, San Francisco,
CA, USA, 2000. Morgan Kaufmann Publishers Inc.
18. QREN. http://www.qren.pt/np4/home. 2015.
19. Peter J. Rousseeuw. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 20:53
65, 1987.
20. SciKit-Learn. http://scikit-learn.org/. 2015.
21. Site Institucional. VIATECLA. http://www.viatecla.com. 2015.
22. Kiri Wagstaff, Claire Cardie, Seth Rogers, and Stefan Schroedl. Constrained kmeans clustering with background knowledge. In In ICML, pages 577584. Morgan
Kaufmann, 2001.
23. Kaijun Wang, Junying Zhang, Dan Li, Xinna Zhang, and Tao Guo. Adaptive
affinity propagation clustering. CoRR, abs/0805.1096, 2008.
24. Ian H. Witten and Eibe Frank. Data Mining: Practical Machine Learning Tools
and Techniques, Second Edition (Morgan Kaufmann Series in Data Management
Systems). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2005.