Вы находитесь на странице: 1из 16

Master in Advanced Analytics

2017/2018

Data Mining
Project
Group Elements

Matrícula Nome do Aluno

M20170606 Biazi Bayer

M20170590 Rodrigo Pupo Ribeiro

M20170366 Vasco Castela

1
Table of Contents
Group Elements 1

Executive Summary 3

Introduction 3

Company 3

Tools and Methodology 4

Variable Analysis 4
Complexity Variable 4
Priority Variable 4
NumOperators Variable 4
NumUpdates Variable 4
OpenType Variable 5

Data Preparation 5
PCA 5
Pre-Processing 5
Duplicate Data 5
Outliers 5
Missing Values 6
Variables Selection 6

Segmentation 6

Conclusion 7
Clusters 7
Cluster 1 - Quick Resolutions Tickets 7
Cluster 2 - Multi-Technology Tickets 8
Cluster 3 - Automatic Tickets 8

Annex 9

2
Executive Summary
It’s required analyse all backlog information about tickets opened through the clients
infrastructures to be aware about the behavior of these requests and then decide how to
improve the current process to become more efficient and understand the customers needs.

Introduction
The Service Manager is an application used to manage all tickets opened. The four categories
of tickets are listed below.

1 - Change: This category implies in modification of a Configuration Item (CI) and requires
approval from customers;
2 - Task (or Change Task): Its name explains by itself. It is used when you have to do a lot of
changes, by different teams, before the main goal get done;
3 - Incident - It is a ticket created due to an interruption of service or some kind of delay which
could lead to a very bad quality of service of a CI. In general it is detected automatically and an
agent opens a ticket. It have high priority due to SLA;
4 - Problem: This type is created after previous incidents have been triggered without a
consistent resolution have been provided. It has a Moderate to High priority.

Company
The company, XPTO IT Consulting, is a global company in IT related area with a branch in
Portugal. It has a hundred employees, 40 great companies as customers and a annual revenue
around 2 Million dollars. Its a real company with real data but, due to N.D.A., its name was
changed to avoid legal problems.

3
Tools and Methodology
Analysing the data provided by XPTO IT Consulting requires a lot of effort when a not proper
software is used, like Excel. Despite the data had been delivered in a CSV file, Excel could not
helped to create an appropriate analysis. Because of this we was propelled to use advanced
analysis tools such as SAS, Python or R. Since we already have used SAS in a previous project
we have decided to try R at this time. So, we used R (version 3.4.3) and R Studio (1.1.383).

After data cleansing the analysis was made on five main variables: Complexity, Priority, Number
of Operators, Number of Updates and Open Type.

We used K-mean strategy and Elbow Graph in order to calculate the number of clusters.

Variable Analysis

Complexity Variable
This variable defines how complex is the ticket task. The value of this variable is integer type.
See the graph 1 in Annex section to understand its distribution.

Min. 1st Quartile Median Mean 3rd Quartile Max.

0.000 3.000 3.000 2.587 4.000 4.000

Priority Variable
This variable defines how fast that ticket should be resolved. The value of this variable is integer
type. See the graph 2 in Annex section to understand its distribution.

Min. 1st Quartile Median Mean 3rd Quartile Max.

0.000 3.000 4.000 3.536 0.000 1.000

NumOperators Variable
This variable shows the number of operators which has worked on each ticket. The value of this
variable is integer type. See the graph 3 in Annex section to understand its distribution.

Min. 1st Quartile Median Mean 3rd Quartile Max.

1.000 1.000 2.000 1.782 2.000 16.000

NumUpdates Variable
This variable shows the number of updates done on each ticket. The value of this variable is
integer type. See the graph 4 in Annex section to understand its distribution.

4
Min. 1st Quartile Median Mean 3rd Quartile Max.

1.000 6.000 7.000 6.465 8.000 29.000

OpenType Variable
This variable shows how a ticket was open: type 0 was open by a person and type 1 was open
automatically by a monitoring agent running on servers. The value of this variable is integer
type. See graph 5 in Annex section to understand its distribution.

Min. 1st Quartile Median Mean 3rd Quartile Max.

0.000 0.000 0.000 0.2511 1.000 1.000

Data Preparation

PCA

The PCA (Principal Component Analysis) was not used due to small number of useful variable
showed in the provided data (feature selection and dimensionality reduction).

Pre-Processing

Duplicate Data
After a initial analysis using Microsoft SQL Server, all duplicate data was deleted.

Outliers

Complexity variable - The data is distributed, mainly, in types 0, 3 and 4. So, type 1 and 2 was
considered outliers and removed from dataset in Excel.

Priority variable - The data is distributed, mainly, in types 2, 3 and 4. So, types 0 and 1 was
considered outliers and removed from dataset in Excel.

NumOperators variable - Data with seven (or greater) operators was considered outliers and
removed from dataset in Excel.

NumUpdates variable - Data with eighteen (or greater) updates was considered outliers and
removed from dataset in Excel.

OpenType variable - This variable has no outliers.

5
Missing Values

All missing values were removed in R Studio. See script 4 in Annex section.

Variables Selection

Variables Excluded

Some variables was excluded because they don’t affect the cluster analysis, e.g. some related
with dates such “Duration in Days” and “Days without Update” or another boolean variable
called “CustomVisible”.

Correlated Variables

According to Pearson coefficient (table 1, in Annex section), there are no redundant variables
since none of them could reach 0.89. Thus, we decided to keep those last five variables.

Segmentation
In order to identify business behaviors and establish a well-defined administration strategy
accordingly, a previous knowledge about the existing customers tickets profile, therefore a main
segmentation was implemented, based on customer ticket life cycle. Thus, tickets are grouped
in different clusters, each of them reflecting, as much as possible, common tickets
characteristics by the defined segmentation.

The clustering method used was the K-Means due to its known better than average
performance when compared to the Hierarchical Clustering technique. K-Means method uses
repeatable cycles to calculate centroids until find the best centroid while Hierarchical Clustering
uses series of divisions which elements are grouped and ungrouped according their
characteristics and, in the end, are presented with dendograms.

Furthermore, the best number of clusters to be considered, which should be a balance between
complexity and descriptive/discrimination ability in a strictly descriptive analysis, was determined
using the Elbow graph rule, whose goal is to find the minimum number of necessary clusters
within each segmentation dataset.

The Hubert and “D” indexes were used in order to confirm the Elbow graph choice. (See graph
8 in Annex section.)

6
Conclusion

Clusters

After calculating the number of clusters defined we got the following centroids:

# NumUpdates NumOperators OpenType Complexity Priority

1 1.35013 1.312356 0.09506494 2.868015 3.750464

2 12.75628 3.103037 0.11065054 2.948820 3.707047

3 7.18822 1.747730 0.31791414 2.451834 3.447654

Cluster 1 - Quick Resolutions Tickets

This cluster presents tickets with small numbers of interactions and small numbers of updates.
These characteristics lead us to conclude this is about “quick resolution tickets” that a single
operators (NumOperators = 1.31) is capable to resolve with a single update (NumUpdates =
1.35). Besides, it seems to be tickets automatically open (OpenType near to zero) and its
resolution is, frequently, fast.

7
Cluster 2 - Multi-Technology Tickets

This cluster presents tickets that require multi-team (NumOperators = 3.1) to solve. This
happen when a ticket needs to be analysed by different teams, with different backgrounds and
many status updates (NumUpdates = 12.75, by far the higher) until it get solved. There are
some types of tickets which need approval workflow, thus they need a lot of update in its life
cycle. Because of this they can have high index of complexity (Complexity near to 3).

Cluster 3 - Automatic Tickets

This cluster presents automatically open (OpenType > 0.30). These tickets use to be an
“incident” that requires quick updates (NumUpdates = 7.18) due to SLA and can transit through
some people (NumOperators = 1.74) until it get solved.

8
Annex

Graph 1 - Complexity variable distribution and outliers.

Graph 2 - Priority variable distribution and outliers.

9
Graph 3 - NumOperators variable distribution and outliers.

Graph 4 - NumUpdates variable distribution and outliers.

10
Graph 5 - OpenType variable distribution.

Graph 6 - Elbow Graph.

11
Graph 7 - Optimal number of clusters according Hubert and D indexes.

Table 1 - Correlation table.

#install.packages("NbClust")
#install.packages("corrplot")
#install.packages("gmodels")
#install.packages("factoextra")
#install.packages("cluster")
#install.packages("ggplot2")

script 1 - Install required packages

library(cluster)

12
library(NbClust)
library(corrplot)
library(gmodels)
library(factoextra)
library(ggplot2)

script 2 - Use packages

source_data=read.csv("C:\\Dados\\Backlog.csv",header = TRUE,
sep = ";", dec=",")

script 3 - Import data to R

source_data=na.exclude(source_data)

script 4 - Remove missing values

summary(source_data)

script 5 - Summary of data

unlist(lapply(source_data, class))

script 6 - Columns metadata

barplot(table(source_data$NumOperators),
ylab="Frequency", xlab="Number of Operators",col="cyan")
abline(v = 7.3, lty =2)

script 7 - Plot NumOperators histogram

barplot(table(source_data$NumUpdates),
ylab="Frequency", xlab="Number of Updates",col="cyan")
abline(v = 20.5, lty =2)

script 8 - Plot NumUpdates histogram

barplot(table(source_data$OpenType),
ylab="Frequency", xlab="Open Type",col="cyan")
abline(v = 20.5, lty =2)

script 9 - Plot OpenType histogram

barplot(table(source_data$Complexity),
ylab="Frequency", xlab="Complexity",col="cyan")
abline(v = 3.6, lty =2)

13
abline(v = 1.4, lty =2)

script 10 - Plot Complexity histogram

barplot(table(source_data$Priority),
ylab="Frequency", xlab="Priority",col="cyan")
abline(v = 2.5, lty =2)

script 11 - Plot Priority histogram

myData <- as.data.frame(lapply(source_data, as.numeric))


M <-cor(myData)
head(round(M,2))
corrplot(as.matrix(cor(myData)), method="number")

script 12 - Correlation table

myData <- as.data.frame(lapply(source_data, as.numeric))


wss <- NULL
for (i in 1:5) {
wss[i] <- sum(kmeans(myData, centers=i)$withinss)
}
plot(1:5, wss, type="b", xlab="Number of Clusters",
ylab="Within groups sum of squares")
abline(v = 3, lty =2)

script 13 - Elbow Graph

k <- NbClust(data_sample,distance="euclidean",
method="kmeans",min.nc=2,max.nc=5)

script 14 - Hubert and “D” indexes (enforce Elbow Graph).

set.seed(123)
km.res <- kmeans(myData, 2, nstart = 5)

script 15 - K-Means group members observation.

fviz_cluster(km.res, data = myData, geom = "point", stand = FALSE)

script 16 - K-Means cluster graph

14
km.res$centers

script 17 - Centroids information

15

Вам также может понравиться