Clustering Analysis Project

Master in Advanced Analytics
2017/2018
Data Mining
Project
Group Elements
Matrícula Nome do Aluno
M20170606 Biazi Bayer
M20170590 Rodrigo Pupo Ribeiro
M20170366 Vasco Castela
1
Table of Contents
Group Elements 1
Executive Summary 3
Introduction 3
Company 3
Tools and Methodology 4
Variable Analysis 4
Complexity Variable 4
Priority Variable 4
NumOperators Variable 4
NumUpdates Variable 4
OpenType Variable 5
Data Preparation 5
PCA 5
Pre-Processing 5
Duplicate Data 5
Outliers 5
Missing Values 6
Variables Selection 6
Segmentation 6
Conclusion 7
Clusters 7
Cluster 1 - Quick Resolutions Tickets 7
Cluster 2 - Multi-Technology Tickets 8
Cluster 3 - Automatic Tickets 8
Annex 9
2
Executive Summary
It’s required analyse all backlog information about tickets opened through the clients
infrastructures to be aware about the behavior of these requests and then decide how to
improve the current process to become more efficient and understand the customers needs.
Introduction
The Service Manager is an application used to manage all tickets opened. The four categories
of tickets are listed below.
1 - Change: This category implies in modification of a Configuration Item (CI) and requires
approval from customers;
2 - Task (or Change Task): Its name explains by itself. It is used when you have to do a lot of
changes, by different teams, before the main goal get done;
3 - Incident - It is a ticket created due to an interruption of service or some kind of delay which
could lead to a very bad quality of service of a CI. In general it is detected automatically and an
agent opens a ticket. It have high priority due to SLA;
4 - Problem: This type is created after previous incidents have been triggered without a
consistent resolution have been provided. It has a Moderate to High priority.
Company
The company, XPTO IT Consulting, is a global company in IT related area with a branch in
Portugal. It has a hundred employees, 40 great companies as customers and a annual revenue
around 2 Million dollars. Its a real company with real data but, due to N.D.A., its name was
changed to avoid legal problems.
3
Tools and Methodology
Analysing the data provided by XPTO IT Consulting requires a lot of effort when a not proper
software is used, like Excel. Despite the data had been delivered in a CSV file, Excel could not
helped to create an appropriate analysis. Because of this we was propelled to use advanced
analysis tools such as SAS, Python or R. Since we already have used SAS in a previous project
we have decided to try R at this time. So, we used R (version 3.4.3) and R Studio (1.1.383).
After data cleansing the analysis was made on five main variables: Complexity, Priority, Number
of Operators, Number of Updates and Open Type.
We used K-mean strategy and Elbow Graph in order to calculate the number of clusters.
Variable Analysis
Complexity Variable
This variable defines how complex is the ticket task. The value of this variable is integer type.
See the graph 1 in Annex section to understand its distribution.
Min. 1st Quartile Median Mean 3rd Quartile Max.
0.000 3.000 3.000 2.587 4.000 4.000
Priority Variable
This variable defines how fast that ticket should be resolved. The value of this variable is integer
type. See the graph 2 in Annex section to understand its distribution.
0.000 3.000 4.000 3.536 0.000 1.000
NumOperators Variable
This variable shows the number of operators which has worked on each ticket. The value of this
variable is integer type. See the graph 3 in Annex section to understand its distribution.
1.000 1.000 2.000 1.782 2.000 16.000
NumUpdates Variable
This variable shows the number of updates done on each ticket. The value of this variable is
integer type. See the graph 4 in Annex section to understand its distribution.
4
1.000 6.000 7.000 6.465 8.000 29.000
OpenType Variable
This variable shows how a ticket was open: type 0 was open by a person and type 1 was open
automatically by a monitoring agent running on servers. The value of this variable is integer
type. See graph 5 in Annex section to understand its distribution.
0.000 0.000 0.000 0.2511 1.000 1.000
Data Preparation
PCA
The PCA (Principal Component Analysis) was not used due to small number of useful variable
showed in the provided data (feature selection and dimensionality reduction).
Pre-Processing
Duplicate Data
After a initial analysis using Microsoft SQL Server, all duplicate data was deleted.
Outliers
Complexity variable - The data is distributed, mainly, in types 0, 3 and 4. So, type 1 and 2 was
considered outliers and removed from dataset in Excel.
Priority variable - The data is distributed, mainly, in types 2, 3 and 4. So, types 0 and 1 was
considered outliers and removed from dataset in Excel.
NumOperators variable - Data with seven (or greater) operators was considered outliers and
removed from dataset in Excel.
NumUpdates variable - Data with eighteen (or greater) updates was considered outliers and
removed from dataset in Excel.
OpenType variable - This variable has no outliers.
5
Missing Values
All missing values were removed in R Studio. See script 4 in Annex section.
Variables Selection
Variables Excluded
Some variables was excluded because they don’t affect the cluster analysis, e.g. some related
with dates such “Duration in Days” and “Days without Update” or another boolean variable
called “CustomVisible”.
Correlated Variables
According to Pearson coefficient (table 1, in Annex section), there are no redundant variables
since none of them could reach 0.89. Thus, we decided to keep those last five variables.
Segmentation
In order to identify business behaviors and establish a well-defined administration strategy
accordingly, a previous knowledge about the existing customers tickets profile, therefore a main
segmentation was implemented, based on customer ticket life cycle. Thus, tickets are grouped
in different clusters, each of them reflecting, as much as possible, common tickets
characteristics by the defined segmentation.
The clustering method used was the K-Means due to its known better than average
performance when compared to the Hierarchical Clustering technique. K-Means method uses
repeatable cycles to calculate centroids until find the best centroid while Hierarchical Clustering
uses series of divisions which elements are grouped and ungrouped according their
characteristics and, in the end, are presented with dendograms.
Furthermore, the best number of clusters to be considered, which should be a balance between
complexity and descriptive/discrimination ability in a strictly descriptive analysis, was determined
using the Elbow graph rule, whose goal is to find the minimum number of necessary clusters
within each segmentation dataset.
The Hubert and “D” indexes were used in order to confirm the Elbow graph choice. (See graph
8 in Annex section.)
6
Conclusion
Clusters
After calculating the number of clusters defined we got the following centroids:
# NumUpdates NumOperators OpenType Complexity Priority
1 1.35013 1.312356 0.09506494 2.868015 3.750464
2 12.75628 3.103037 0.11065054 2.948820 3.707047
3 7.18822 1.747730 0.31791414 2.451834 3.447654
Cluster 1 - Quick Resolutions Tickets
This cluster presents tickets with small numbers of interactions and small numbers of updates.
These characteristics lead us to conclude this is about “quick resolution tickets” that a single
operators (NumOperators = 1.31) is capable to resolve with a single update (NumUpdates =
1.35). Besides, it seems to be tickets automatically open (OpenType near to zero) and its
resolution is, frequently, fast.
7
Cluster 2 - Multi-Technology Tickets
This cluster presents tickets that require multi-team (NumOperators = 3.1) to solve. This
happen when a ticket needs to be analysed by different teams, with different backgrounds and
many status updates (NumUpdates = 12.75, by far the higher) until it get solved. There are
some types of tickets which need approval workflow, thus they need a lot of update in its life
cycle. Because of this they can have high index of complexity (Complexity near to 3).
Cluster 3 - Automatic Tickets
This cluster presents automatically open (OpenType > 0.30). These tickets use to be an
“incident” that requires quick updates (NumUpdates = 7.18) due to SLA and can transit through
some people (NumOperators = 1.74) until it get solved.
8
Annex
Graph 1 - Complexity variable distribution and outliers.
Graph 2 - Priority variable distribution and outliers.
9
Graph 3 - NumOperators variable distribution and outliers.
Graph 4 - NumUpdates variable distribution and outliers.
10
Graph 5 - OpenType variable distribution.
Graph 6 - Elbow Graph.
11
Graph 7 - Optimal number of clusters according Hubert and D indexes.
Table 1 - Correlation table.
#install.packages("NbClust")
#install.packages("corrplot")
#install.packages("gmodels")
#install.packages("factoextra")
#install.packages("cluster")
#install.packages("ggplot2")
script 1 - Install required packages
library(cluster)
12
library(NbClust)
library(corrplot)
library(gmodels)
library(factoextra)
library(ggplot2)
script 2 - Use packages
source_data=read.csv("C:\\Dados\\Backlog.csv",header = TRUE,
sep = ";", dec=",")
script 3 - Import data to R
source_data=na.exclude(source_data)
script 4 - Remove missing values
summary(source_data)
script 5 - Summary of data
unlist(lapply(source_data, class))
script 6 - Columns metadata
barplot(table(source_data$NumOperators),
ylab="Frequency", xlab="Number of Operators",col="cyan")
abline(v = 7.3, lty =2)
script 7 - Plot NumOperators histogram
barplot(table(source_data$NumUpdates),
ylab="Frequency", xlab="Number of Updates",col="cyan")
script 8 - Plot NumUpdates histogram
barplot(table(source_data$OpenType),
ylab="Frequency", xlab="Open Type",col="cyan")
script 9 - Plot OpenType histogram
barplot(table(source_data$Complexity),
ylab="Frequency", xlab="Complexity",col="cyan")
13
script 10 - Plot Complexity histogram
barplot(table(source_data$Priority),
ylab="Frequency", xlab="Priority",col="cyan")
script 11 - Plot Priority histogram
myData <- as.data.frame(lapply(source_data, as.numeric))

M <-cor(myData)
head(round(M,2))
corrplot(as.matrix(cor(myData)), method="number")
script 12 - Correlation table
myData <- as.data.frame(lapply(source_data, as.numeric))

wss <- NULL
for (i in 1:5) {
wss[i] <- sum(kmeans(myData, centers=i)$withinss)
}
plot(1:5, wss, type="b", xlab="Number of Clusters",
ylab="Within groups sum of squares")
abline(v = 3, lty =2)
script 13 - Elbow Graph
k <- NbClust(data_sample,distance="euclidean",
method="kmeans",min.nc=2,max.nc=5)
script 14 - Hubert and “D” indexes (enforce Elbow Graph).
set.seed(123)
km.res <- kmeans(myData, 2, nstart = 5)
script 15 - K-Means group members observation.
fviz_cluster(km.res, data = myData, geom = "point", stand = FALSE)
script 16 - K-Means cluster graph
14
km.res$centers
script 17 - Centroids information
15

Clustering Analysis Project

Загружено:

Сведения о документе

Исходное описание:

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Clustering Analysis Project

Загружено:

Авторское право:

Доступные форматы

Master in Advanced Analytics

Matrícula Nome do Aluno

M20170606 Biazi Bayer

M20170590 Rodrigo Pupo Ribeiro

M20170366 Vasco Castela

Tools and Methodology 4

Min. 1st Quartile Median Mean 3rd Quartile Max.

0.000 3.000 3.000 2.587 4.000 4.000

Min. 1st Quartile Median Mean 3rd Quartile Max.

0.000 3.000 4.000 3.536 0.000 1.000

Min. 1st Quartile Median Mean 3rd Quartile Max.

1.000 1.000 2.000 1.782 2.000 16.000

1.000 6.000 7.000 6.465 8.000 29.000

Min. 1st Quartile Median Mean 3rd Quartile Max.

0.000 0.000 0.000 0.2511 1.000 1.000

OpenType variable - This variable has no outliers.

# NumUpdates NumOperators OpenType Complexity Priority

1 1.35013 1.312356 0.09506494 2.868015 3.750464

2 12.75628 3.103037 0.11065054 2.948820 3.707047

3 7.18822 1.747730 0.31791414 2.451834 3.447654

Cluster 1 - Quick Resolutions Tickets

Cluster 3 - Automatic Tickets

Graph 1 - Complexity variable distribution and outliers.

Graph 2 - Priority variable distribution and outliers.

Graph 4 - NumUpdates variable distribution and outliers.

Graph 6 - Elbow Graph.

Table 1 - Correlation table.

script 1 - Install required packages

script 2 - Use packages

script 3 - Import data to R

script 4 - Remove missing values

script 5 - Summary of data

script 6 - Columns metadata

script 7 - Plot NumOperators histogram

script 8 - Plot NumUpdates histogram

script 9 - Plot OpenType histogram

script 10 - Plot Complexity histogram

script 11 - Plot Priority histogram

myData <- as.data.frame(lapply(source_data, as.numeric))

script 12 - Correlation table

myData <- as.data.frame(lapply(source_data, as.numeric))

script 13 - Elbow Graph

script 14 - Hubert and “D” indexes (enforce Elbow Graph).

script 15 - K-Means group members observation.

fviz_cluster(km.res, data = myData, geom = "point", stand = FALSE)

script 16 - K-Means cluster graph

script 17 - Centroids information

Вам также может понравиться