Академический Документы
Профессиональный Документы
Культура Документы
2017/2018
Data Mining
Project
Group Elements
1
Table of Contents
Group Elements 1
Executive Summary 3
Introduction 3
Company 3
Variable Analysis 4
Complexity Variable 4
Priority Variable 4
NumOperators Variable 4
NumUpdates Variable 4
OpenType Variable 5
Data Preparation 5
PCA 5
Pre-Processing 5
Duplicate Data 5
Outliers 5
Missing Values 6
Variables Selection 6
Segmentation 6
Conclusion 7
Clusters 7
Cluster 1 - Quick Resolutions Tickets 7
Cluster 2 - Multi-Technology Tickets 8
Cluster 3 - Automatic Tickets 8
Annex 9
2
Executive Summary
It’s required analyse all backlog information about tickets opened through the clients
infrastructures to be aware about the behavior of these requests and then decide how to
improve the current process to become more efficient and understand the customers needs.
Introduction
The Service Manager is an application used to manage all tickets opened. The four categories
of tickets are listed below.
1 - Change: This category implies in modification of a Configuration Item (CI) and requires
approval from customers;
2 - Task (or Change Task): Its name explains by itself. It is used when you have to do a lot of
changes, by different teams, before the main goal get done;
3 - Incident - It is a ticket created due to an interruption of service or some kind of delay which
could lead to a very bad quality of service of a CI. In general it is detected automatically and an
agent opens a ticket. It have high priority due to SLA;
4 - Problem: This type is created after previous incidents have been triggered without a
consistent resolution have been provided. It has a Moderate to High priority.
Company
The company, XPTO IT Consulting, is a global company in IT related area with a branch in
Portugal. It has a hundred employees, 40 great companies as customers and a annual revenue
around 2 Million dollars. Its a real company with real data but, due to N.D.A., its name was
changed to avoid legal problems.
3
Tools and Methodology
Analysing the data provided by XPTO IT Consulting requires a lot of effort when a not proper
software is used, like Excel. Despite the data had been delivered in a CSV file, Excel could not
helped to create an appropriate analysis. Because of this we was propelled to use advanced
analysis tools such as SAS, Python or R. Since we already have used SAS in a previous project
we have decided to try R at this time. So, we used R (version 3.4.3) and R Studio (1.1.383).
After data cleansing the analysis was made on five main variables: Complexity, Priority, Number
of Operators, Number of Updates and Open Type.
We used K-mean strategy and Elbow Graph in order to calculate the number of clusters.
Variable Analysis
Complexity Variable
This variable defines how complex is the ticket task. The value of this variable is integer type.
See the graph 1 in Annex section to understand its distribution.
Priority Variable
This variable defines how fast that ticket should be resolved. The value of this variable is integer
type. See the graph 2 in Annex section to understand its distribution.
NumOperators Variable
This variable shows the number of operators which has worked on each ticket. The value of this
variable is integer type. See the graph 3 in Annex section to understand its distribution.
NumUpdates Variable
This variable shows the number of updates done on each ticket. The value of this variable is
integer type. See the graph 4 in Annex section to understand its distribution.
4
Min. 1st Quartile Median Mean 3rd Quartile Max.
OpenType Variable
This variable shows how a ticket was open: type 0 was open by a person and type 1 was open
automatically by a monitoring agent running on servers. The value of this variable is integer
type. See graph 5 in Annex section to understand its distribution.
Data Preparation
PCA
The PCA (Principal Component Analysis) was not used due to small number of useful variable
showed in the provided data (feature selection and dimensionality reduction).
Pre-Processing
Duplicate Data
After a initial analysis using Microsoft SQL Server, all duplicate data was deleted.
Outliers
Complexity variable - The data is distributed, mainly, in types 0, 3 and 4. So, type 1 and 2 was
considered outliers and removed from dataset in Excel.
Priority variable - The data is distributed, mainly, in types 2, 3 and 4. So, types 0 and 1 was
considered outliers and removed from dataset in Excel.
NumOperators variable - Data with seven (or greater) operators was considered outliers and
removed from dataset in Excel.
NumUpdates variable - Data with eighteen (or greater) updates was considered outliers and
removed from dataset in Excel.
5
Missing Values
All missing values were removed in R Studio. See script 4 in Annex section.
Variables Selection
Variables Excluded
Some variables was excluded because they don’t affect the cluster analysis, e.g. some related
with dates such “Duration in Days” and “Days without Update” or another boolean variable
called “CustomVisible”.
Correlated Variables
According to Pearson coefficient (table 1, in Annex section), there are no redundant variables
since none of them could reach 0.89. Thus, we decided to keep those last five variables.
Segmentation
In order to identify business behaviors and establish a well-defined administration strategy
accordingly, a previous knowledge about the existing customers tickets profile, therefore a main
segmentation was implemented, based on customer ticket life cycle. Thus, tickets are grouped
in different clusters, each of them reflecting, as much as possible, common tickets
characteristics by the defined segmentation.
The clustering method used was the K-Means due to its known better than average
performance when compared to the Hierarchical Clustering technique. K-Means method uses
repeatable cycles to calculate centroids until find the best centroid while Hierarchical Clustering
uses series of divisions which elements are grouped and ungrouped according their
characteristics and, in the end, are presented with dendograms.
Furthermore, the best number of clusters to be considered, which should be a balance between
complexity and descriptive/discrimination ability in a strictly descriptive analysis, was determined
using the Elbow graph rule, whose goal is to find the minimum number of necessary clusters
within each segmentation dataset.
The Hubert and “D” indexes were used in order to confirm the Elbow graph choice. (See graph
8 in Annex section.)
6
Conclusion
Clusters
After calculating the number of clusters defined we got the following centroids:
This cluster presents tickets with small numbers of interactions and small numbers of updates.
These characteristics lead us to conclude this is about “quick resolution tickets” that a single
operators (NumOperators = 1.31) is capable to resolve with a single update (NumUpdates =
1.35). Besides, it seems to be tickets automatically open (OpenType near to zero) and its
resolution is, frequently, fast.
7
Cluster 2 - Multi-Technology Tickets
This cluster presents tickets that require multi-team (NumOperators = 3.1) to solve. This
happen when a ticket needs to be analysed by different teams, with different backgrounds and
many status updates (NumUpdates = 12.75, by far the higher) until it get solved. There are
some types of tickets which need approval workflow, thus they need a lot of update in its life
cycle. Because of this they can have high index of complexity (Complexity near to 3).
This cluster presents automatically open (OpenType > 0.30). These tickets use to be an
“incident” that requires quick updates (NumUpdates = 7.18) due to SLA and can transit through
some people (NumOperators = 1.74) until it get solved.
8
Annex
9
Graph 3 - NumOperators variable distribution and outliers.
10
Graph 5 - OpenType variable distribution.
11
Graph 7 - Optimal number of clusters according Hubert and D indexes.
#install.packages("NbClust")
#install.packages("corrplot")
#install.packages("gmodels")
#install.packages("factoextra")
#install.packages("cluster")
#install.packages("ggplot2")
library(cluster)
12
library(NbClust)
library(corrplot)
library(gmodels)
library(factoextra)
library(ggplot2)
source_data=read.csv("C:\\Dados\\Backlog.csv",header = TRUE,
sep = ";", dec=",")
source_data=na.exclude(source_data)
summary(source_data)
unlist(lapply(source_data, class))
barplot(table(source_data$NumOperators),
ylab="Frequency", xlab="Number of Operators",col="cyan")
abline(v = 7.3, lty =2)
barplot(table(source_data$NumUpdates),
ylab="Frequency", xlab="Number of Updates",col="cyan")
abline(v = 20.5, lty =2)
barplot(table(source_data$OpenType),
ylab="Frequency", xlab="Open Type",col="cyan")
abline(v = 20.5, lty =2)
barplot(table(source_data$Complexity),
ylab="Frequency", xlab="Complexity",col="cyan")
abline(v = 3.6, lty =2)
13
abline(v = 1.4, lty =2)
barplot(table(source_data$Priority),
ylab="Frequency", xlab="Priority",col="cyan")
abline(v = 2.5, lty =2)
k <- NbClust(data_sample,distance="euclidean",
method="kmeans",min.nc=2,max.nc=5)
set.seed(123)
km.res <- kmeans(myData, 2, nstart = 5)
14
km.res$centers
15