Вы находитесь на странице: 1из 6

Predicting Cross-Site Scripting (XSS) Security

Vulnerabilities in Web Applications


Mukesh Kumar Gupta , Mahesh Chandra Govil , Girdhari Singh

Department of Computer Science & Engineering


Malviya National Institute of Technology, Jaipur-302017, Rajasthan, INDIA
Email: mukeshgupta@skit.ac.in, govilmc@yahoo.com, girdharisingh@rediffmail.com
AbstractRecently, machine-learning based vulnerability prediction models are gaining popularity in web security space, as
these models provide a simple and efficient way to handle web
application security issues. Existing state-of-art Cross-Site Scripting (XSS) vulnerability prediction approaches do not consider
the context of the user-input in output-statement, which is very
important to identify context-sensitive security vulnerabilities. In
this paper, we propose a novel feature extraction algorithm to
extract basic and context features from the source code of web
applications. Our approach uses these features to build various
machine-learning models for predicting context-sensitive CrossSite Scripting (XSS) security vulnerabilities. Experimental results
show that the proposed features based prediction models can
discriminate vulnerable code from non-vulnerable code at a very
low false rate.

metrics and static code attributes are important features in


building of machine learning model for predicting security
vulnerabilities. XSS vulnerability has different characteristics
than general vulnerabilities. The existing XSS prediction approaches [5] [6] [7] do not consider the contexts of user input,
which is very important for prediction of XSS vulnerabilities.

Keywordsweb application security, cross-site scripting vulnerability, machine learning, context-sensitive, input validation

The rest of the paper is organized as follows. Section 2


presents the background and motivation for this work. Section
3 discusses prior works related to the cross-site scripting
vulnerability prediction. Section 4 describes a novel feature
extraction approach for building prediction models. Section
5 provides the details of the data set, experimental setting,
and performance measures, which are utilized to evaluate the
performance of the proposed approach. Section 6 discusses the
experimental results and compares the proposed approach with
existing approaches. Finally, Section 7 concludes the paper
noting and mentions future research directions.

I.

I NTRODUCTION

Nowadays, a large number of people are depending on web


applications for social communications, health services, financial transactions and other purposes. However, the presence
of security vulnerabilities limits the use of these applications
as malicious user can steal sensitive information (e.g. cookie,
session), send illegal HTTP requests, redirect benign user to
malicious websites, install malware, and perform various other
malicious operations. The recent security statistical reports reveal that approximately 55% of assessed web applications have
security vulnerability [1]. In 2013, Open Web Application
Security Project (OWASP) and Common Vulnerabilities and
Exposures (CWE) reported Cross-Site Scripting (XSS) as one
of the most serious vulnerability in web applications. The main
reason of XSS vulnerabilities is weakness in the source code
which permits the use of user-input in web server outputstatement without any validation.
Researchers have proposed various static and dynamic
analysis based approaches [2] to detect XSS vulnerabilities
in source code of web applications. Static analysis based
detection techniques use a set of predefined rules to detect
vulnerabilities in source code without executing it. These
techniques are easy to implement, but produce too many
false positive results. Dynamic analysis based techniques use
complex analysis to produce more accurate results. However,
they require large test cases to ensure any false negative results.
Alternatively, researchers [3] [4] have revealed that software

978-1-4799-1966-6/15/$31.00 2015 IEEE

162

In this paper, we propose a novel approach to extract


basic and context features from source code to build machinelearning based vulnerability prediction model. To the best
of our knowledge, ours is the first approach to use context
information for predicting XSS vulnerabilities. The proposed
approach has implemented in a prototype tool for automatic
extraction of these features from PHP source code.

II.

BACKGROUND AND M OTIVATION

In this section, we describe cross-site scripting vulnerabilities, and discuss the limitations of existing related approaches,
which motivates us for this work.
A. Cross-Site Scripting Vulnerabilities
Cross-site scripting (XSS) is an application-level codeinjection type security vulnerability. It occurs whenever a
server program (i.e. dynamic web page) uses unrestricted
input via HTTP request, database, or files in its response
without any validation. It allows a malicious user to steal
sensitive information (i.e. cookie, session) and performs other
malevolent operations. The figure 1 illustrates the sequence
of given below steps to perform stored XSS attack. Initially,
the malicious-user uses a blog site comment-form to inserts
and stores the malicious scripts into site database. Then, the
legitimate-user sends an HTTP request to site for viewing the
latest comments. The site returns the stored comments along
with the scripts in its response. Finally, the legitimate user
browser executes the scripts and sends legitimate-user sensitive
information to an attacker server.

Fig. 1.

Sequence Diagram to Represent XSS Attack Scenario

Cross-Site Scripting vulnerabilities are of three types:


stored, reflected, and document object model (DOM)-based.
The stored XSS vulnerability occurs whenever user input is
stored in the database and then it is used in the response
page. It generally occurs in forums, blogs, and in the social networking sites. The reflected XSS vulnerability occurs
whenever user input is referenced in an immediate response
web page without proper validation. It commonly occurs in
error or greeting messages. These two types of vulnerabilities
occur due to improper validation of user input by the server
side program. DOM-based XSS vulnerability occurs whenever
client side program uses invalidated user input dynamically
obtained from the DOM (document object model) structure.
B. Limitations of Vulnerability Detection Approaches
Researchers have proposed many approaches to detect
XSS vulnerabilities in the source code of web applications
developed in PHP. Most of existing static analysis based
vulnerability detection approaches are based on predefined
static rules. They considered the source code as vulnerable
free (i.e. safe), if user input is validated in the source code
through PHP standard built-in sanitization functions (e.g. htmlspecialchars, htmlentities etc). In dynamic web applications, an
output-statement refers user-input with constant HTML strings
to generate dynamic response. This combination reveals an
HTML context, where use of standard built-in function has
been always not sufficient to avoid vulnerabilities. Figure 2 is
a snippet of vulnerable source statements in which user-input
is referenced in the output-statement with different context
(e.g. JavaScript, HTML attribute name, value, comment, URL
etc). In statement 4, user input is assigned into a userdefined PHP variable $user_input. Then, in statement 12,
output-statement uses user- input in the style attribute to
change the text color dynamically. This statement cannot be
secured with the default configuration of standard sanitization
function. For example the user-input (i.e. attack vector) green
onmouseover = alert (/Meow!/), exploits XSS vulnerability.
Even the use of ENT_QUOTES quote style option also

163

1.<html>
2.<body>
3. <?php
accept and store user input in a variable
4.$user_input= $_GET['UserData'];
use of user input in HTML tag attribute name
5.echo "<div ". $user_input ."= bob />";
use of user input in HTML tag name
6. echo "<". $user_input." href= \"/bob\" />";
use of user input in HTML tag body section
7.echo $ user_input;
use of user input in double quoted attribute value
8.echo "<div id=\"".$user_input."\">text</div>";
use of user input in no quoted attribute value
9.echo "<div id=".$ user_input.">content</div>";
10.echo "<a href ='". $ user_input."'>click</a>";
use of user input in single quoted event handler
11.echo"<div onmouseover=\"x='". $user_input."\"\>";
use of user input to change text color dynamically
12.echo "<span style=color:".
htmlspecialchars ($user_input).">welcome</span>"";
presence of user input in HTML comment block
13.<! -- <?php echo $ user_input; ?> -->
use of user input in script block
14.<script>
<?php $ user_input = $_POST ['UserData'] ;
$ user_input=intval($tainted); echo $user_input;?>
</script>
15. ?>
16. </body> </html>

Fig. 2. PHP code in which user-input is referenced in different HTML


contexts

fails to prevent XSS attack in such scenario. Similarly, userinput referenced in the comment block, Javascript block and
body_anchor_NQ_Attr_Val context in statements 13, 11, 10
respectively requires special context-sensitive filters to avoid
XSS vulnerabilities. In paper [8], the authors used pattern
matching technique to identify HTML context and ESAPI
escaping library to mitigate XSS vulnerabilities in Java-based
web applications. Saxena et. al (2011) [9] pointed out
that context-mismatched sanitization and inconsistent multiple
sanitization issues require essential modification in approach
to detect XSS vulnerabilities. Alternatively, in paper [7],
the authors have revealed that text-mining based machine
learning models provide probabilities remarks of vulnerable
source code segments. It helps to save the time and efforts of
developer by concentrating on the code segment predicted to
vulnerable.
III.

R ELATED WORK

This section describes the prior vulnerabilities prediction


approaches used for classifying vulnerable file, class, or statements from benign ones. Table I contains the comparison of
existing approaches based on various parameters, i.e. feature
set used for prediction, datasets for training and testing,
machine-learning classifiers used for building prediction models, performance evaluation measures.
Chowdhury et al.(2011) [4] used complexity, cohesion and
coupling metrics to predict vulnerability-prone files in Mozilla
Firefox. Similarly, Shin et al. [3] utilized code complexity,
code churn, and developer activity metrics to discriminate gen-

TABLE I.

C OMPARISON OF R ELATED V ULNERABILITIES P REDICTION A PPROACHES

Authors

Features

Applications

Source code
language &
Identified
vulnerabilities

Machine-learning
Algorithms

Performance

Shin et al.
(2011) [3]

code complexity,
code churn,
and developer
activity metrics

Mozilla Firefox
Web Browser,
Red Hat Enterprise
Linux kernel

C++ /
General vulnerabilities

Logistic regression,
J48, Random forest,
NB, Bayesian network

Recall: 80 %

Chowdhury et al.
(2011) [4]

code complexity,
coupling and
cohesion metrics

Mozilla Firefox
Web Browser

C++ /
General vulnerabilities

Logistic regression,
C4.5, Random forest,
NB

PHP /XSS, SQL

C 4.5, NB, MLP

PHP / XSS, SQL

Logistic regression,
MLP

Shar el
(2013)
Shar et
(2013)

al.
[6]
al.
[10]

Static code attributes


Static and dynamic
code attributes

PHP
Web Applications
PHP
Web Applications

Precision: 4%
Recall: 74%
Accuracy: 73%
F1 measure: 73%
Recall: >78%
Pf: <6%
Recall : 86%
Pf: 3%.
Accuracy : 87%,
Precision : 85%
Recall : 88%

Hovsepyan et al.
(2012) [11]

unique word

K9 mail
client application

Java / any vulnerabilities

SVM

Roccardo et al.
(2014) [7]

Unique-words &
Uni_tokens

Java Application
& Drupal CMS

Java & PHP /


General &
XSS vulnerabilities

Decision Trees,
k-Nearest Neighbour,
NB, Random Forest
and SVM

Recall: 82 %.

Walden et al.
(2014) [5]

PHP tokens and


software metrics (i.e. LOC,
cyclomatic complexity)

PHP-MyAdmin,
Moodle,
and Drupal CMS

PHP/ Code Injection,


CSRF , XSS,
Path Disclosure

Random Forest

Recall: 80.5%
Accuracy: 75.4%

eral vulnerable files from benign ones. They investigated that


the complex code programs are more prone to vulnerability
and, predicated 80% known vulnerable files with less than 25%
false positives.
In contrary, Shar et al.(2012) [12] claimed that simple
and tiny code program has many XSS vulnerabilities, which
resembles with our observations. Therefore, general vulnerability prediction models based on those metrics are not efficient to
predict XSS vulnerability. Authors [10] considered that the use
of invalid user input is the main source of XSS vulnerabilities.
They extracted input, output, validation and sanitization code
constructs through static and dynamic analysis. Further, they
classified these code constructs in various categories and used
them as features to build machine-learning models for identification of vulnerable statements. Medeiros et al.(2014) [13]
also considered code constructs that manipulate the strings to
minimize false positive results in their vulnerability detection
approach.
Recently, Scandariato et al.(2014) [7] proposed first textmining based machine-learning models for predicting vulnerable files in the source code of software applications. They
considered source code as text and characterized each source
code file as a term frequency vector. In [5], Walden et
al. (2014) compared the software metrics and text mining
features (i.e. unigram) and observed that text-mining features
provide significantly better performance in prediction of XSS
vulnerabilities. Our approach is an extension of their works.
Fig. 3.

IV.

Flow graph of proposed vulnerability Prediction approach

P ROPOSED VULNERABILITY PREDICTION APPROACH

The proposed approach proceeds as follows. First, we extract user-input context features in the output-statement. Then,
we extract basic features that represent the characteristics of
input, output and, validation and sanitization routines through
PHP tokenizing process. Further, we construct a feature set
by combining basic and context features. Finally, we use various machine-learning algorithms to build various prediction
models. Figure 3 depicts the steps, which are followed in the

164

proposed approach to classify vulnerable source code file from


benign one.
A. Main Feature Extraction Algorithms
This section presents a systematic procedure in an algorithm 4.1 to extract relevant features from PHP source code
files. It consists of two major steps i.e. extraction of basic

code features and identification of user-input context in the


output-statement. In the first step, we extract HTML-Block

Because, these source code statements are irrelevant in building


of the vulnerability prediction model, as they do not contribute
any meaningful information.

Algorithm 1 Feature Vector Preparation


1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:
14:
15:
16:
17:
18:
19:
20:
21:
22:
23:
24:
25:
26:
27:
28:
29:
30:
31:
32:
33:
34:
35:
36:
37:
38:
39:
40:
41:

Input: Source Code Files (P) = P1, P2, P3 ...


Output: Feature Vector List (F)
B= Array of HTML-Block in source code file
TypeB= A variable that represents the block-context
TypeS=A variable that represents the user-input context
G List of PHP global variables
P Number of source code programs
DToken List of ignorable variable
for each source program Pi P do
for each extracted HTML-Block Bj B do
. %comment: Block-Context may be style, script, comment etc %
Set TypeB = Block Context of Bj extracted by HTML DOM
parser
for each statement Sk Bj do
. %comment: convert each statement into a set of tokens %
T = TokenGetAll (Sk);
for each token ti T do
. %comment: start of token processing loop %
if ( ti_name DToken List) then
Nop
else if (ti_name == T_VARIABLE) then
if (ti_value list G) then
Add ti_value in feature set F
ELSE
tagged_T= Attach block tag with ti_name
Add tagged_T in feature set F
end if
else if (ti_name == T_ECHO) then
tagged_T= Attach block tag with ti_name
Add tagged_T in feature set F
. %comment: Let T_CONSTANT_ENCAPSED_STRING: T1, %
. %comment:Let T_ENCAPSED_AND_WHITESPACE: T2, %
else if (ti_name == T 1 k ti_name == T 2) then
if (ti_value contain some HTML code) then
TypeS = Call Context_Finder (ti_value, TypeB)
Add TypeS in feature set F
ELSE
Add ti_value in feature set F
end if
ELSE
tagged_T= Attach block tag with ti_val
Add tagged_T in feature set F
end if
end forend for
end forend for
end forend for
end forend for

that contains PHP codes through the HTML DOM parser.


The type of HTML-Block (i.e. script, style, comment etc) is
considered as a Block-Context for PHP codes present in it.
Then, we tokenize each extracted HTML-Block contents (i.e.
PHP codes) through Zend engines Lexical Scanner. In this
process, some tokens are tagged with corresponding BlockContext (as mention in algorithm 4.1) and considered these
tagged tokens as basic code features in our feature set. In the
second step, we process remaining tokens, whose token value
either contains an incomplete HTML tag in a constant string
or PHP code in a HTML tag. These token-value represents
the user-input context in an output statement. We extract userinput context through proposed context-finder algorithm 4.2.
Further, Block-Context tag is attached with user-input context
and tagged context is considered in our feature set. During
the feature extraction process, we pre-process source code for
removal of HTML comment statements, which do not have
any PHP code constructs, and pure HTML code statements.

165

Algorithm 2 Context-Finder: Find user-input context in


output-statement
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:
14:
15:
16:
17:
18:
19:
20:
21:
22:
23:
24:
25:
26:
27:
28:
29:
30:
31:
32:

Input: A String S that contains HTML code and Block Context TypeB
Output: Context of user-input in output-statement
TypeS=A variable representing user-input context
if ( Is complete HTML tag present in source string S ?) then
return TypeB;
else if (Is String S start with < && end with= | = | =) then
. %comment: BLOCK 1 %
if (Is just after < any special tag (i.e a |style|script) is in string S)
then
if ( Is any event handler (i.e. onload) present in string S) then
T ypeS = T ypeB +Event_Attr_V alue+[DQ|SQ|N Q];
return TypeS;
ELSE
T ypeS = T ypeB +ST ag_Attr_V alue+ (DQ|SQ|N Q);
return TypeS;
end if
ELSE
. %comment: if special tag is not present just after the opening tag %
if (Is any event handler (i.e. onload) present in string S) then
TypeS
=T ypeB
+
T ag_Event_Attr_V al
+
(DQ|SQ|N Q);
return TypeS;
else if ( Is any style in string S) then
T ypeS
=
T ypeB + T ag_CSS_Attr_V al +
(DQ|SQ|N Q);
return TypeS;
end if
end if
else if (IsStringS == < N on_special_tag ) then
. %comment: BLOCK 2 %
T ypeS = T ypeB + Attr_N ame;
return TypeS;
else if (IsStringS == < ) then
. %comment: BLOCK 3 %
T ypeS = T ypeB + T ag_N ame;
return TypeS;
Else
return TypeB;
end if

B. Example
The features extracted through proposed feature extraction
approach can be explained as follows. Consider the source
code statements given below in figure 4. In this code statements
3-6, 9-11, 16-18 are present in HTML Element, Comment
and Script block respectively. The proposed approach extracts
HTML_ELEMENT, Comment, and Script block-context and
then tokenize each block code statement to build feature set.
The extracted features corresponding to the given source code
are presented in table II. The text mining based prediction
approach [7] tokenizes the source code and consider PHP
tokens as a feature. In this approach, the user-defined variable
names are considered as a different feature that are not useful
from vulnerability point of view. Also, T_STRING feature
is considered for all strings (e.g. ENT_QUOTES, htmlspecialchars etc) in their feature set. However, ENT_QUOTES
is a parameter and htmlspecialchars is sanitisation function
in PHP language, but both are considered in the same category.

1. <html> <body>
2. <?php
3. $data=$_GET['Data'];
4. echo $data ;
5. echo "<div id=\"". $data ."\">content</div>" ;
6. echo "<div onmouseover=\"x='". $data."\"\>";?>
7. <!-8. <?php
9. $data = $_GET['Data'];
10. $data = intval($data);
11. echo $data ;
12. ?>
13. -->
14. <script>
15. <?php
16. $data = $_GET['Data'];
17. $data = intval($data);
18. echo $data ;
19. ?>
20. </script>
21. </body> </html>
Fig. 4.
XSS Vulnerable Source Code Statements that require ContextSensitive Sanitization
TABLE II.
Code Line
1
2

T_VARIABLE_HTML_Element
$_GET_HTML_Element
T_ECHO_HTML_Element,
T_VARIABLE_HTML_Element
T_ECHO_HTML_Element,
HTML_Element_
H_TAG_DQ_Attr_Val,
T_VARIABLE_HTML_Element
T_ECHO_HTML_Element,
HTML_Element_
HTAG_Event_DQ_Attr_Val,
T_VARIABLE_HTML_Element

3
4

6
7
8

T_VARIABLE_comment_block,
$_GET_comment_block
T_VARIABLE_comment_block,
$_GET_comment_block,
T_VARIABLE_comment_block
T_ECHO_comment_block ,
T_VARIABLE_comment_block

9
10
11
12
13
14
15

Walden Features (F2)


T_HTML_INLINE
T_OPEN_TAG
T_VARIABLE($data),
= , T_VARIABLE($_GET), [, ] ;,
T_ECHO,
T_VARIABLE($data), ;
T_ECHO, .,
T_VARIABLE($data), . , ;

B. Prediction performance measures


In this paper, we have extracted features from chosen data
set and built different machine-learning models (i.e. predicator)
with a machine-learning WEKA tool [15]. It is an open source,
platform-independent, and freely available tool, which include
the implementation of different machine learning algorithms
for data mining and machine learning experiments [16].
Similar to related prediction approaches, we have used various
performance measures i.e. recall, precision, F-measures and
accuracy, to evaluate the predictive performance with respect
to vulnerable class samples.

T_ECHO, .,
T_VARIABLE($data), . , ;
T_INLINE_HTML
T_OPEN_TAG
T_VARIABLE($data), = ,
T_VARIABLE($_GET), [, ] ;,
T_VARIABLE($data), =,
T_STRING,
(,T_VARIABLE($data),), ;
T_ECHO, T_VARIABLE($data)
T_CLOSE_TAG
T_INLINE_HTML

T_VARIABLE_script_block,
$_GET_script_block
T_VARIABLE_script_block,
intval_script_block,
T_VARIABLE_script_block
T_VARIABLE_script_block,
T_VARIABLE_script_block

16
17
18
19
20

VI.

T_OPEN_TAG
T_VARIABLE($data), = ,
T_VARIABLE($_GET), [, ] ;,
T_VARIABLE($data), =,
T_STRING,(,
T_VARIABLE($data),), ;
T_ECHO,
T_VARIABLE($data), ;
T_CLOSE_TAG
T_INLINE_HTML

R ESULTS AND DISCUSSION

The results of various feature set namely Walden Features


(F1), and Proposed Features (F2) with respect to evaluation
measures such as precision, recall, f-measure and accuracy
have shown in Table III.

In addition, no discrimination is present in feature vector for


user-input referenced in different HTML Blocks.
V.

A. Experimental Setting:
The 10-fold cross validation technique is used to evaluate
the performance of the proposed approach. We randomly
divide the dataset into 90% training and 10% testing programs, such that both the sets are disjoint. We repeat all
the experiments 10 times with randomly selected training and
testing sets, and final performance is reported by average of
the results.

C OMPARISON OF P ROPOSED E XTRACTED F EATURES


Proposed Approach Features (F3)

3808 unsafe samples that are organized into different categories. Evaluation of the proposed methods are performed on
this dataset, as it provides mostly all the cases required for XSS
vulnerability prediction. This dataset is available for free, opensource and contain a set of PHP source code with their vulnerability labels. It also contains the cases for object oriented
source code that can be helpful to evaluate the performance of
the proposed methods for object-oriented code. This dataset is
better as compared to other existing datasets to evaluate the
proposed approaches. The other existing vulnerability dataset
repositories are Bugzilla, CVE , NVD etc. These contain only
vulnerability information and do not provide source code to
extract vulnerability prediction features. Therefore, these are
inadequate to evaluate our proposed approach. Next, NIST
(National Institute of Standard and Technology) dataset is also
available publicly to evaluate vulnerability prediction methods,
but it contains only limited samples for PHP source code (i.e.
80) which is insufficient to build an efficient machine learning
model.

DATASET AND EXPERIMENTAL SETTINGS

We have used a publicly available GIT repository [14] that


contains a synthetic test case generator. The available dataset
contains 9408 samples written in PHP. It has 5600 safe and

166

The proposed feature set significantly performs better than


existing walden features F1. For example, proposed feature set
produce F-measure of 90.6 % and Accuracy of 92.6 % with
Bagging classifier, which is significantly better than the best
performance of walden features given as F-measure of 61.6%
and Accuracy of 71.3 % results as shown in Table III. Likely,
with other machine learning algorithms, the proposed feature
set significantly improves the performance of the vulnerability
prediction model. The proposed features outperform than other
features due to the reason that the proposed features consider
the tagged tokens and user-input context in output statements,
which is considerably important information for vulnerability
prediction.

TABLE III.

E VALUATION

Approaches
F1 (Walden Features)

F2 (Proposed
Approach)

OF PREDICTORS ON DATA SET FOR

Evaluation
Measure
Precision
Recall
F1-measure
Accuracy

VULNERABILITY PREDICTION

SVM

NB

Bagging

66.2
57.6
61.6
70.9

59.6
39.4
47.5
64.7

67.1
56.9
61.6
71.3

Random
Forest
64.7
53.7
58.7
69.4

Precision

88.7

67.8

93.4

Recall
F1-measure
Accuracy

83.2
85.9
88.9

47
55.5
69.5

88
90.6
92.6

Training models are prepared using various classifiers as


SVM, NB, Bagging, Random Forest, J48, JRip classifiers.
Experimental results indicate that Bagging classifier performs
better as compared to other classifiers in all cases. However,
it is very close to J48 classifier as shown in Table III. J48
performs better than other classifier because it reduces the
effect of attributes with low information gain i.e. irrelevant
attributes. J48 classifier produces the best F-measure of 89.8%
with proposed features, which is best among other classifiers
i.e. 85.9%, 55.5 %, 84.2 2%, 74.8 % respectively for SVM,
NB, Random Forest, and JRip classifiers (results as shown in
Table III).
VII.

XSS

C ONCLUSIONS AND F UTURE WORK

[4]

[5]

[6]

[7]

[8]

Vulnerability prediction is an important task in securing the


web applications before their release. Insecure web applications may cause of stealing personal and crucial user information. This paper proposed a novel approach to extract relevant
features to classify vulnerable source code file from benign
one. Experimental results showed that by considering the context of the user-input significantly improved the performance of
the vulnerability prediction model. We used various machinelearning algorithms to develop the vulnerability prediction
model viz. SVM, NB, Bagging, J48, and JRip classifiers. All
the experiments are performed on a publicly available dataset
for vulnerability prediction. Experimental results showed that
Bagging classifier performed best among all the classifiers. In
the future, we wish to analyze other vulnerabilities present
in web- application source code and work on prediction of
statement level vulnerabilities.

[9]

[10]

[11]

[12]

[13]

ACKNOWLEDGEMENT
We thank James Walden, Associate Professor and Director
of the Center for Information Security, Northern Kentucky
University for providing valuable insights and helpful suggestions on our paper. We also thanks management of Swami
Keshvanand Institute of Technology, Management Gramothan,
Jaipur, Rajasthan, India for most support and encouragement.

[14]

[15]

[16]

R EFERENCES
[1]

WhiteHatSecurity.
Web
statistics
report.
https://whitehatsec.com/categories/statistics-report, 2013.
Accessed:
2013-06-26.
[2] Isatou Hydara, Abu Bakar Md. Sultan, Hazura Zulzalil, and Novia
Admodisastro. Current state of research on cross-site scripting a
systematic literature review. Information and Software Technology,
58(0):170 186, 2015.
[3] Yonghee Shin, A. Meneely, L. Williams, and J.A. Osborne. Evaluating
complexity, code churn, and developer activity metrics as indicators of
software vulnerabilities. IEEE Transactions on Software Engineering,
37(6):772787, Nov 2011.

167

J48

JRip

67.8
56.8
61.8
71.6

65.6
53
58.6
69.7

86.9

92.9

99.2

81.6
84.2
87.6

86.9
89.8
92

60
74.8
83.6

Istehad Chowdhury and Mohammad Zulkernine. Using complexity,


coupling, and cohesion metrics as early indicators of vulnerabilities.
Journal of Systems Architecture, 57(3):294 313, 2011. Special Issue
on Security and Dependability Assurance of Software Architectures.
J. Walden, J. Stuckman, and R. Scandariato. Predicting vulnerable
components: Software metrics vs text mining. IEEE 25th International
Symposium on Software Reliability Engineering (ISSRE), pages 2333,
Nov 2014.
Lwin Khin Shar and Hee Beng Kuan Tan. Predicting sql injection
and cross site scripting vulnerabilities through mining input sanitization
patterns. Information and Software Technology, 55(10):1767 1780,
2013.
R. Scandariato, J. Walden, A. Hovsepyan, and W. Joosen. Predicting
vulnerable software components via text mining. IEEE Transactions on
Software Engineering, 40(10):9931006, Oct 2014.
Lwin Khin Shar and Hee Beng Kuan Tan. Automated removal of
cross site scripting vulnerabilities in web applications. Information and
Software Technology, 54:467478, 2012.
Prateek Saxena, David Molnar, and Benjamin Livshits. Scriptgard:
Automatic context-sensitive sanitization for large-scale legacy web
applications. Proceedings of the 18th ACM Conference on Computer
and Communications Security, pages 601614, 2011.
Lwin Khin Shar, Hee Beng Kuan Tan, and Lionel C. Briand. Mining sql
injection and cross site scripting vulnerabilities using hybrid program
analysis. Proceedings of the 2013 International Conference on Software
Engineering, pages 642651, 2013.
Aram Hovsepyan, Riccardo Scandariato, Wouter Joosen, and James
Walden. Software vulnerability prediction using text analysis techniques. Proceedings of the 4th International Workshop on Security
Measurements and Metrics, pages 710, 2012.
Lwin Khin Shar and Hee Beng Kuan Tan. Predicting common web
application vulnerabilities from input validation and sanitization code
patterns. Proceedings of the 27th IEEE/ACM International Conference
on Automated Software Engineering, pages 310313, 2012.
Ibria Medeiros, Nuno F. Neves, and Miguel Correia. Automatic
detection and correction of web application vulnerabilities using data
mining to predict false positives. Proceedings of the 23rd International
Conference on World Wide Web, pages 6374, 2014.
Bertrand STIVALET Aurelien DELAITRE. Php vulnerabilities test
suite. https://github.com/stivalet/PHP-Vulnerability-test-suite , 2014.
Accessed: 2014-07-13.
Peter Reutemann Eibe Frank, Mark Hall and Len Trigg. Weka: Data
mining tool. http://www.cs.waikato.ac.nz/ml/weka, 2013. Accessed:
2013-06-26.
Ian H. Witten, Eibe Frank, and Mark A. Hall. Data Mining: Practical
Machine Learning Tools and Techniques. Morgan Kaufmann Publishers
Inc., San Francisco, CA, USA, 3rd edition, 2011.

Вам также может понравиться