Академический Документы
Профессиональный Документы
Культура Документы
Defects
Chetan Hireholi
i
Acknowledgment
I would like to take this to thank a lot of eminent personalities, without whose
constant encouragement and support, the endeavor of mine would not have been successful.
Firstly, I would like to thank the PES University, for having Final Year Project
as a part of my curriculum, which gave me a wonderful opportunity to work on research and
presentation abilities, and provided excellent facilities, without which, this project could not
have acquired the orientation, it has now.
At the outset I would like to venerate Prof. Nitin V. Pujari, Chairperson, PES
University, who toned me encompassing the attitude towards the subject implied in this
literary work.
I would also like to thank Dr. Jayashree R., Department of Computer Sci-
ence and Engineering, PES University for her initiative and support which made this work
possible.
ii
Contents
Abstract i
Acknowledgment ii
List of figures v
1 Introduction 1
1.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Literature Survey 4
2.1 A Probabilistic Model for Software Defect Prediction . . . . . . . . . . . . . 4
2.2 Predicting Bugs from History . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Bibliography 53
List of Figures
1 Predicting Bugs from History- Commonly used complexity metrics . . . . . . 5
2 Life Cycle of a Defect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3 Cleansing in OpenRefine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4 Data Transformation in Microsoft Excel . . . . . . . . . . . . . . . . . . . . 12
5 Distribution of Incident Escalations . . . . . . . . . . . . . . . . . . . . . . . 14
6 Analyzing RED Incidents: Customers vs Escalations . . . . . . . . . . . . . 15
7 Analyzing RED Incidents: Modules vs Escalations . . . . . . . . . . . . . . . 17
8 Analyzing RED Incidents: Software release vs Escalations . . . . . . . . . . 17
9 S/w release vs Escalations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
10 Analyzing RED Incidents: OS vs Escalations . . . . . . . . . . . . . . . . . . 18
11 Analyzing RED Incidents: Developer vs Escalations . . . . . . . . . . . . . . 19
12 Other observations made on Incidents . . . . . . . . . . . . . . . . . . . . . . 20
13 Analyzing CR data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
14 Analyzing CR data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
15 Analyzing CRs: Customers vs Escalations . . . . . . . . . . . . . . . . . . . 22
16 Analyzing CRs: Modules vs Escalations . . . . . . . . . . . . . . . . . . . . . 23
17 Analyzing CRs: S/w release vs Escalations . . . . . . . . . . . . . . . . . . . 23
18 Analyzing CRs: OS vs Escalations . . . . . . . . . . . . . . . . . . . . . . . 24
19 Analyzing CRs: Developer vs Escalations . . . . . . . . . . . . . . . . . . . . 25
20 Classifying using: J48 Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
21 Classifying using: J48 Tree: Prefuse Tree . . . . . . . . . . . . . . . . . . . . 28
22 Module as root node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
23 Probability for MODULE . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
24 Probability distribustion for SERVERITY SHORT . . . . . . . . . . . . . . 31
25 ESCALATION as the root node . . . . . . . . . . . . . . . . . . . . . . . . . 31
26 Probability distribution table for ESCALATION . . . . . . . . . . . . . . . . 32
27 Probability distribution table for EXPECTATION . . . . . . . . . . . . . . . 33
28 SEVERITY SHORT as the root node . . . . . . . . . . . . . . . . . . . . . . 33
iv
29 Probability distribution for SEVERITY SHORT . . . . . . . . . . . . . . . . 34
30 Probability distribution for Customer Expectations . . . . . . . . . . . . . . 35
31 Final Cluster Centroids . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
32 Model and evaluation on training set . . . . . . . . . . . . . . . . . . . . . . 37
33 Cluster Centroids I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
34 Cluster Centroids II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
35 Total number of Incidents and thier Escalation count . . . . . . . . . . . . . 42
36 Words with highest frequency mined on GREEN tickets escalated to RED . 44
37 GREEN tickets escalated to RED . . . . . . . . . . . . . . . . . . . . . . . . 44
38 Words with highest frequency mined on GREEN tickets escalated to YELLOW 45
39 GREEN ticket escalated to YELLOW . . . . . . . . . . . . . . . . . . . . . . 45
40 Words with highest frequency mined on GREEN tickets escalated to YELLOW 46
41 YELLOW ticket escalated to RED . . . . . . . . . . . . . . . . . . . . . . . 46
42 Observations made on RED ticket were Escalated . . . . . . . . . . . . . . . 47
43 Plotting the highest mined words . . . . . . . . . . . . . . . . . . . . . . . . 47
44 Words with highest frequency mined . . . . . . . . . . . . . . . . . . . . . . 48
45 Plotting the words with highest frequency mined . . . . . . . . . . . . . . . . 48
46 Words with highest frequency mined . . . . . . . . . . . . . . . . . . . . . . 49
47 Plotting the words with highest frequency mined . . . . . . . . . . . . . . . . 49
48 Output of the program . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
v
1 Introduction
Engineering defects are collected as part of the Quality & Analysis(QA) cycle of
a software product or software application
Product management which manages the usage and direction of the product
1
1.1 Problem Definition
Determine causes for Defects during the Engineering phase which may lead
to Escalation of Customer Support Cases
While most software defects are corrected and tested as part of the prolonged software
development cycle, enterprise software vendors often have to release software products
before all reported defects are corrected, due to deadlines and limited resources.
A small number of these reported defects will be escalated by customers whose busi-
nesses are seriously impacted. Escalated defects must be resolved immediately and
individually by the software vendors at a very high cost. The total costs can be even
greater, including loss of reputation, satisfaction, loyalty, and repeat revenue.
2
1.2 Motivation
Predicts have conducted survey that 80% of the IT budgets goes towards main-
tenance of applications
There is a dismal success rate of 2% of the new projects being successful (funded
from the 20% of the budget allocated to IT)
The research hopes to finding the way tickets information (in our opinion wealth
of information, which is being neglected so far) is being looked at
Mine information that are hidden and are lost in the tickets dump. The mined infor-
mation will then be useful to deduce important details - which would help the Project
Manager of the team to plan out the activities appropriately.
3
2 Literature Survey
In this process we have found many works related to the software bugs prediction
and which has helped us in understanding the kind of knowledge that can be captured by
bugs. The following are the work carried out by the specific persons in the area of the Soft-
ware Defect Prediction:
Although a number of approaches have been taken to quality prediction for software, none
have achieved widespread applicability. The authors aim here is to produce a single model
to combine the diverse forms of, often causal, evidence available in software development
in a more natural and efficient way than done previously. The authors use graphical prob-
ability models (also known as Bayesian Belief Networks) as the appropriate formalism for
representing this evidence. The authors have used the subjective judgments of experienced
project managers to build the probability model and use this model to produce forecasts
about the software quality throughout the development life cycle. Moreover, the causal or
influence structure of the model more naturally mirrors the real world sequence of events and
relations than can be achieved with other formalism. We used WEKA in order to apply the
Bayesian Network Classifier. We selected the attributed: Escalation, Expectation, Modules
& Severity. Then by rotating the attributes as the root nodes, results were captured.
A disadvantage of a reliability model of this complexity is the amount of data that is needed
to support a statistically significant validation study. More detailed description regarding
the application of Bayesian Classification is covered in section 4.3.[PMDM]
4
2.2 Predicting Bugs from History
Version and bug databases contain a wealth of information about software failures how
the failure occurred, who was affected, and how it was fixed. Such defect information can be
automatically mined from software archives; and it frequently turns out that some modules
are far more defect-prone than others. How do these differences come to be?
The authors have researched how code properties like (a) code complexity, (b) the problem
domain, (c) past history, or (d) process quality affect software defects, and how their cor-
relation with defects in the past can be used to predict future software properties where
the defects are, how to fix them, as well as the associated cost.[PRBD]
Figure 1:
Commonly used complexity metrics
5
Learning from history means learning from successes and failures and how to make the
right decisions in the future. In our case, the history of successes and failures is provided
by the bug database: systematic mining uncovers which modules are most prone to defects
and failures. Correlating defects with complexity metrics or the problem domain is useful
in predicting problems for new or evolved components. Learning from history has one big
advantage: one can focus on the aspect of history that is most relevant for the current situ-
ation. Thus the history data provided to us by Helwet Packard(HP): which consisted of the
incident and the change requests data proved helpful during the statistical analysis. More
descriptive coverage of the statistical analysis is covered in section 4.2. The dataset given to
us by HP is explained in detail in section 3.
Some more research work carried out by few people on similar problem domain:
6
Predicting Failures with Hidden Markov Models [PHMM]:
The authors have come up with an approach using Hidden Markov Model(HMM)
to recognize the patterns in failures. Since HMMs give accurate outputs on lesser
attributes, we cannot use this approach in recognizing the defect patterns in the
defect reports.
Data Mining Based Social Network Analysis from Online Behavior [DSNA]: The
authors have used the Neural Networks and done sentimental analysis on the social
networks to predict the online behavior of people. The approach used in sentiment
analysis have given me insights on Natural Language Tool Kit(NLTK) and how
NLTK can be used to find the sentiments of the user data. This motivated me
to pick up NLTK to analyze the tickets dataset provided by Hewlett Packard.
Detailed information on the application of NLTK is explained in section 4.
7
3 Exploring the Dataset
The methodology describes on exploring the data set acquired from Hewlett Packard(HP)
and closely analyzing it. This is the beginning phase of the project where the data is under-
stood and meaningful analysis is done. This gives an high level overview on the datasets.
HP had provided two datasets: Customer Incident data set and Change Requests(CR) data
set.
The Incidents dataset had the customer cases. These cases included - troubleshooting errors,
field issues, installation issues, environment issues and all other cases which are related to
the Software. When a customer logs a unique case which the team identifies it as a Change
Request, and a respective entry is made in the CR dataset.
The life line of each entry is captured and is represented numerically(For e.g.
DAYS TO OPEN, DAYS TO FIXED, etc.)
Each entry has an Escalation set by the CPE Support Team on consulting the
customer. The Escalation comes in 3 categories: RED, ORANGE & GREEN.
RED being the most high priority ticket, YELLOW being a potentially important
ticket and GREEN being a ticket with lesser business impact compared to the
Red and Yellow tickets.
Each entry has an Expectation set by the CPE Support team on the inputs by
the customer(E.g. Investigate Issue & Hotfix requested, Answer Question, Create
Enhancement, Investigate Issue, etc.)
8
The mail communication between the Developer and Customer can be found
under: NOTE CUSTOMER. This field contains all the information about the
Defect being tracked with the team.
Each entry has a Severity set by the CPE Support Team on consulting the cus-
tomer. The Severity comes in 3 stages: Low, Medium & High.
Each entry has a date describing the date on which the case was Escalated. It
called QUIXY ESCALATED ON DATE & QUIXY ESCALATED YELLOW DATE.
On observing the Incident data set, We needed to come up with a life cycle of how a defect
is tracked with the team.
The below figure illustrates the life cycle of the defect. This process is currently in use by
the team.
Figure 2:
Life Cycle of a Defect
9
3.2 Change Requests(CR) Dataset
The CRs dataset contains the Incident Cases which were identified as Change Requests.
The life line of each entry is captured and is represented numerically(For e.g.
DAYS TO OPEN, DAYS TO FIXED, etc.)
Each entry has an Escalation set by the CPE Support Team on consulting the
customer. The Escalation comes as Y(Escalated) and N(Not Escalated)
Each entry has Expectation set by the CPE Support team on the inputs by the
customer(For e.g. Investigate Issue & Hotfix requested, Answer Question, Create
Enhancement, Investigate Issue, etc.)
The mail communication between the Developer and Customer can be found
under: NOTE CUSTOMER. This field contains all the information about the
Defect being tracked with the team.
Each entry has a Severity set by the CPE Support Team on consulting the cus-
tomer. The Severity comes in 3 stages: Low, Medium & High.
Each entry has a date describing the date on which the case was Escalated. It
called QUIXY ESCALATED ON DATE
10
3.3 Cleaning the Dataset
After receiving the huge dataset the next approach was to clean the data. There were many
discrepancies in the dataset viz., presence of non-numeric values in the dates field, the data
in the rows were shifted to the left by 2 - 4 columns; due to this data - shift, the data were
not aligned to its specific header. The following steps were performed to clean the dataset.
Removing Discrepancies
OpenRefine (formerly Google Refine) is a powerful tool for working with messy data:
cleaning it; transforming it from one format into another; and extending it with web
services and external data.
This tool brought down the immense cleaning effort. OpenRefine helped to explore
large data sets with ease. It provided functions to transform the data to make it
unified. E.g. In the Customer column, there were different names for a single
company. This tool helps in organizing varied names to a single one. Thus unifying
the word:- Vodafone, Vodafone Inc, vodafone to a single name- Vodafone. It
also help in removing special characters and make the text readable. It also takes care
of the case sensitivity of the context. i.e. we can edit the contents of multiple rows
using the feature: Text Facet, shown in the figure next page.
11
Figure 3:
Figure 4:
Data Transformation in Microsoft Excel
12
Removing Stop Words
The Text Mining (tm) package from R language helped converting the dataset into
a corpus. The pre - processing of the data is efficiently done by the tm package in
R. The various text transformations offered by the tm package are: removeNumbers
removePunctuation removeWords stemDocument stripWhitespace.
13
4 Defect Escalation Analysis
After getting the data cleaned, the next approach was Statistical Analysis of the data. The
Statistical Analysis will further unveil some under - the - hood facts
The initial Statistical Analyses of the the data received from HP- incidents.csv and crs.csv
was carried out by Microsoft Excel(MS Excel). The Pivot Charts obtained from MS Excel
helped in graphically analyzing the huge datasets.
The incidents.csv contained all the customer incidents received reported to team.
Figure 5:
Distribution of Incident Escalations incidents.csv
There in total 125 RED Escalated incidents, 3831 GREEN incidents and 329 YEL-
LOW incidents in the dataset.
14
Analyzing RED Incidents: Customers vs Escalations
In certain situations, the escalation of a task is necessary. For example, a user is doing
a task, and is unable to complete that task within a certain period of time. In such
cases, where the user/customer is completely blocked, a case is RED Escalated. These
RED Escalated Incident Cases are High Priority cases and have to be addressed on
high severity.
Companies RHEINENERGIE, HEWLETT PACKARD, DEUTSCHE BANK had the
highest number of RED Escalations.
Figure 6:
Analyzing RED Incidents: Customers vs Escalations
15
Company behavior analysis: RHEINENERGIE
RHEINENERGIE had maximum RED escalations among the other customers. There
were around 28 incident cases registered to the Operations Team. The Patterns ob-
served in those 28 incidents are:
Installation(6 nos.)
Number of incidents which move to CR: 15; 53.57% of the incidents move to CRs
16
Analyzing RED Incidents: Modules vs Escalations
The software modules: Ops - Action Agent (opcmona) & Installation had the
highest number of RED escalations reported to the Operations Team.
Figure 7:
Analyzing RED Incidents: Modules vs Escalations
Figure 8:
Analyzing RED Incidents: Software release vs Escalations
17
The incident frequencies of the other Software Versions is shown below:
Figure 9:
S/w release vs Escalations
Figure 10:
Analyzing RED Incidents: OS vs Escalations
18
Analyzing RED Incidents: Developer vs Escalations
This describes the Developer associated with the incident case. Below is the distribu-
tion of incidents among the developers. The developer who was been assigned with
high number of the incident cases is prasad.m.k hp.com
Figure 11:
Analyzing RED Incidents: Developer vs Escalations
19
Figure 12:
Other observations made on Incidents
20
4.1.2 Analyzing Change Requests Dataset
The second data set provided by HP was the CR data(Incident cases which were
Change Requests). These are the cases which are added to the product backlog.
Each entry had an escalation attached to it. The three escalations attached were:
Showstopper, Yes and No. Below is the distribution of CRs and the nature of
escalation it carried. There were in total 10,387 CR entries. Of these, 10219 cases did
not escalate, 75 cases escalated and 93 of them were marked as Showstopper.
Figure 13:
Analyzing CR data
Figure 14:
CR data
21
Analyzing CRs: Customers vs Escalations
The company TATA CONSULTANCY SERVICES LTD. had the maximum
Showstopper escalations. Where as Allegis, NORTHROP GRUMMAN,PepperWeed
are the companies with highest Y(Yes) escalations.
Figure 15:
Analyzing CRs: Customers vs Escalations
22
Figure 16:
Analyzing CRs: Modules vs Escalations
Figure 17:
Analyzing CRs: S/w release vs Escalations
23
Analyzing CRs: OS vs Escalations
The Software running Windows OS had the maximum number of both Showstop-
per and Y(yes) escalations. Note: Submitter of these tickets tend to choose the OS
fields as they want to. Some choose the exact versions where the issue was seen or
reported or some choose just at a high level. No strict rules observed
Figure 18:
Analyzing CRs: OS vs Escalations
24
Figure 19:
Analyzing CRs: Developer vs Escalations
Post statistical analysis we used WEKA for applying few machine learning algorithms
onto the dataset. In the next section we will use few datamining concepts along with
machine learning algorithms to collect meaningfull conclusion from the dataset.
25
4.2 Applying Machine Learning on the Dataset
In this phase, classification and clustering is applied on the data to get to know the main
attributes which are responsible to trigger an Escalation. By using WEKA tool, which offers
various machine learning algorithms to use on the data set, certain informative conclusions
have been drawn. Classification is a data mining function that assigns items in a collection
to target categories or classes. The goal of classification is to accurately predict the target
class for each case in the data.
Here the target class will be the Escalation attribute of each ticket. The class assignments
are know: viz., Severity of a Bug, Expectation on the Defect Resolution, the Modules of the
Software, etc. By computing these important attributes of a ticket, the classification algo-
rithm finds the relationship between these values and predicts the vale of the target. In this
work, I have chosen J48 Decision Tree Algorithm and Bayes Network Classifier Algorithm
to predict the Target class: Escalation.
Attributes selected:
Escalations(Yellow, Red)
26
Expectation(Contains the customer expectation on the resolution of the ticket from
the support team)
Modules
Severity
Results:
Number of Leaves : 5
27
* The first number is the total number of instances weightof instances reaching the leaf.
The second number is the number weight of those instances that are miss-classified.
Figure 20:
Classifying using: J48 Tree
Figure 21:
Classifying using: J48 Tree: Prefuse Tree
After observing that the incorrectly classified instances were greater than the cor-
rectly classified instances, J48 Decision Tree did not yield the required answers.
28
Bayes Network Classifier, A Supervised Learning
Recent work in supervised learning has shown that a surprisingly simple Bayesian classifier
with strong assumptions of independence among features, called naive Bayes, is competi-
tive with state-of-the-art classifiers such as C4.5. This fact raises the question of whether a
classifier with less restrictive assumptions can perform even better. In this paper we evalu-
ate approaches for inducing classifiers from data, based on the theory of learning Bayesian
networks. These networks are factored representations of probability distributions that gen-
eralize the naive Bayesian classifier and explicitly represent statements about independence.
We used WEKA to apply this Classifier on the Data set.
By using this classifier, the Probability Distribution is found out:
Attributes selected:
MODULES
SEVERITY SHORT
Red Escalation:
Yellow escalation:
29
Figure 23: Probability for MODULE
The modules: Ops - Action Agent (opcacta) & Installation: Have the highest
number of RED escalations reported to the Operations Team.
30
Figure 24:
Probability distribustion for SERVERITY SHORT
Figure 25:
ESCALATION as the root node
31
Figure 26:
Probability distribution table for ESCALATION
32
Figure 27:
Probability distribution table for EXPECTATION
URGENT: 44.90%
HIGH: 44.90%
MEDIUM: 09.00%
LOW: 13.00%
URGENT: 18.10%
HIGH: 63.40%
MEDIUM: 17.20%
LOW: 13.00%
Figure 28:
SEVERITY SHORT as the root node
33
Probability distribution for SEVERITY SHORT
Figure 29:
Probability distribution for SEVERITY SHORT
URGENT: 44.9%
HIGH: 18.8%
MEDIUM: 14.6%
LOW: 25%
URGENT: 55.1%
HIGH: 81.2%
MEDIUM: 85.4%
LOW: 75%
34
Probability distribution for Customer Expectations
Figure 30:
Probability distribution for Customer Expectations
35
4.2.2 Clustering the Incidents Data
Cluster analysis or clustering is the task of grouping a set of objects in such a way that
objects in the same group (called a cluster) are more similar (in some sense or another) to
each other than to those in other groups (clusters).
Number of iterations: 3
Figure 31:
Final Cluster Centroids
36
Figure 32:
Model and evaluation on training set
In the below cluster centroid, the instances are divided according to the Escalation
Type of the tickets:
Figure 33:
Cluster Centroids I
37
In the below cluster centroid, the instances are dived according to the Severity nature
of the tickets:
Figure 34:
Cluster Centroids II
38
Predictive Apriori - An Apriori variant
Predictive Apriori algorithm use larger support and traded with higher confidence,
and calculate the expected accuracy in Bayesian framework. The result of this algo-
rithm maximizes the expected accuracy for future data of association rules.
We used WEKA to apply this algorithm to the incident Dataset.
Below are the findings:
CUSTOMER ENTITLEMENT
SEVERITY SHORT
CUSTOMER ENTITLEMENT
SEVERITY SHORT
MODULE
OPERATING SYSTEM
39
Best rules fond:
1. CUSTOMER ENTITLEMENT = Premier & SEVERITY SHORT
= Medium(11) == ESCALATION = Yellow(11) Accuracy:98.84%
2. CUSTOMER ENTITLEMENT = Premier OS & Linux(11) == ES-
CALATION = Yellow(11) Accuracy: 98.46%
3. MODULE = Ops - Logfile Encapsulator[opcle](10) == ESCALA-
TION = Yellow(10) Accuracy: 98.70%
The above rules describe that:
When the module is Ops - Log Encapsulator[opcle], there is 98.70% chance that
it will be a Yellow Escalated.
When the Customer Entitlement is Premier and the Severity of the ticket is
Medium, there is 98.46% chance that it would be Yellow Escalated.
40
Simple Apriori Algorithm - Apriori Algorithm is an association rule mining that
was founded in 1994
Apriori Algorithm works by several steps. First, the candidate item sets are generated.
Then, scan the database to check the support of these item sets. This later will generate
the frequent 1-item sets. In this first scan, the 1-item sets are generated by eliminating
item sets with support below the threshold value. Later, the passes candidates became
k-item sets that generated after k-1 of threshold founded. The iteration of database
scanning and calculating support will be resulting support and confidence of each
association rule that found.
Attributes selected:
CUSTOMER ENTITLEMENT
MODULES
SEVERITY SHORT
OS
41
4.2.3 Text Mining and Natural Language Tool Kit (NLTK)
After evaluating the results acquired from Phase I, a final conclusion cannot be drawn.
It did not answer- what actually triggers an incident so that it escalates. This phase
describes the use of Text Mining and Natural Language processing to determine the
triggering factor of an incident.
Figure 35:
Total number of Incidents and thier Escalation count
42
The main crux was hidden in finding out What made an Incident Ticket to get
RED Escalated from other escalation states?
The following is the step by step process in finding out:What might help to identify
the reason of an Escalation We took the dataset(Incidents.csv) and performed the
following tasks using R:
43
Observations made on GREEN tickets which were RED Escalated It was
observed that, opcle was most talked module. It can be also observed that the words:
please, hotfix & support were used the most in the mail chain exchanged between
the customer and the developer.
Figure 36:
Words with highest frequency mined on GREEN tickets escalated to RED
Figure 37:
GREEN tickets escalated to RED
44
Observations made on GREEN ticket which were YELLOW Escalated It
was observed that, support was most talked word. It can be also observed that the
words: issue & time were used the most in the mail chain exchanged between the
customer and the developer.
Figure 38:
Words with highest frequency mined on GREEN tickets escalated to YELLOW
Figure 39:
GREEN ticket escalated to YELLOW
45
Observations made on YELLOW ticket which were RED Escalated This
set of corpus did not yield anything notable. But it did brought out the name of the
developer which was most associated with the resolution of the tickets
Figure 40:
Words with highest frequency mined on YELLOW tickets escalated to RED
Figure 41:
YELLOW ticket escalated to RED
46
Observations made on RED ticket were Escalated It was observed that, opc-
mona was most talked module. It can be also observed that the words: waiting,
hotfix & issue were used the most in the mail chain exchanged between the customer
and the developer.
Figure 42:
Observations made on RED ticket were Escalated
Figure 43:
Plotting the highest mined words
47
Observations made on the Whole RED Escalated Tickets There were in total
125 entries RED escalated in the Incidents. Text mining on all of the 125 entries
revealed the below details:
Figure 44:
Words with highest frequency mined
The words: issue, please, support & escalation were used the most in the mail
chain exchanged between Customer and the Team
Figure 45:
Plotting the words with highest frequency mined
48
Observations made on Whole GREEN Escalated Tickets There were in total
3831 entries GREEN escalated in the Incidents. Text mining on all of these entries
revealed the below details:
Figure 46:
Words with highest frequency mined
Mining the whole GREEN Escalated tickets did not yield valuable information.
Figure 47:
Plotting the words with highest frequency mined
49
The above observations show that the mail chains in which the key words: please,
hotfix, support & please are most expected to get converted to a RED Escalation.
The we used all these key words and built a program which takes input the Incident
data dump and scans the email chain. As the probability of these key words increases,
it alerts the user when it crosses the threshold limit. The threshold limit can be ad-
justed by the Developer based on the on going trend.
Figure 48:
Output of the program
50
5 Results and Conclusions
By text mining and applying machine learning algorithms on the incident dataset, we ob-
tained the below results:
The mail chain of a ticket which is going to be escalated to Red will contain these
words in high occurrences : please, hotfix, support.
The software modules: Ops - Logfile Encapsulator (Opcle), Ops - Action Agent (op-
cacta) & Installation had the highest number of red escalations reported to the team
as incidents. Whereas Ops - Monitor Agent (opcmona) & Installation had the highest
showstopper escalations for change requests.
By applying the Predictive Apriori algorithm on the Incident dataset, we observed the
following:
We got a confidence of 98.70% when a module reported was Opcle and the
Escalation of the case was Yellow.
For an Engineering team it is really important to avoid any major Red escalations. The
team gets a lot of incidents which need to be resolved in a limited time. Since the number of
incidents are more, it becomes hard for the team to keep track of all the incident issues with
respect of the criticality and the severity of the incident. By implementing such predictive
51
mechanisms where, an incident which will turn RED can be alerted to the team. This would
help the manager to allocated appropriate resources based on the criticality of the tickets
coming in. This would help in the resolution of the incident ticket within the stipulated time
and avoiding any unwanted escalations. This would indeed help in maintaining the trust of
the customer as well.
More accurate and varied results could have been achieved, but the missing data in the
dataset limited us from it. Due to the discrepancies in the data (data being shifted by 3-4
columns), we had to ignore such inconsistent entries in the dataset to perform the statistical
analysis.
The use of NLTK proved to be very helpful in extracting the meaningful conclusion
from the tickets dataset. NLTK can be used to analyze real time behavior of the tickets
incoming to the team. This analysis can be used in providing proactive resolutions to the
customers, thus preventing the tickets to get escalated.
52
References
[BAG] L. Mach Learn, Bagging Predictors Breiman
[SFTWR] Ishani Arora, Vivek Tetarwal, Anju Saha, Software Defect Prediction
[SFRC] Stamatia Bibi, Grigorios Tsoumakas, I. Vlahavas, Ioannis Stamelos, Prediction Using
Regression via Classification
[IEMD] Ian H. Witten, Eibe Frank, Mark A. Hall,Data Mining Practical Machine Learning
Tools and Techniques, Third Edition, Copyright
c 2011 Elsevier Inc.
[EFRB] Eibe Frank,Computer Science Department, University of Waikato, New Zealand and
Remco R. Bouckaert,Xtal Mountain Information Technology, Auckland, New Zealand,
Naive Bayes for Text Classification with Unbalanced Classes
[JMJ11] Jiawei Han, Micheline Kamber, Jian Pei, Data Mining: Concepts and Techniques,
Third Edition, 2011 Addison-Wesley, Reading, MA, 1999.
[IT2008] Irina Tudor, Association Rule Mining as a Data Mining Technique, 2008.
[PMDM] Norman Fenton, Paul Krause and Martin Neil, A Probabilistic Model for Software
Defect Prediction, 2006
[BLMK] Billy Edward Hunt, Jr., Overland Park,KS (Us); Jennifer J- Kirkpatrick,Olathe, KS
(US); Richard Allan Kloss, Wlnllgen Josseph, SOFTWARE DEFECT PREDICTION,
2014
[PEFS] Cathrin Weil, Rahul Premraj, Thomas Zimmermann, Andreas ZellerPredicting Effort
to Fix Software Bugs, 2006
53
[CSDE] Victor S. Shenga, Bin Gub, Wei Fangc, Jian Wud, Cost-Sensitive Learning for Defect
Escalation, 2001
[DSNA] Jaideep Srivastava, Muhammad A. Ahmad, Nishith Pathak, David Kuo-Wei Hsu Data
Mining Based Social Network Analysis from Online Behavior, 2008
[PHMM] Felix Salfner, Predicting Failures with Hidden Markov Models, 2005
54