Вы находитесь на странице: 1из 12

54 IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. 17, NO.

1, FEBRUARY 2009

A Large-Scale Hidden Semi-Markov Model for


Anomaly Detection on User Browsing Behaviors
Yi Xie and Shun-Zheng Yu, Member, IEEE

Abstract—Many methods designed to create defenses against Because the DDoS defense mechanisms that are based on
distributed denial of service (DDoS) attacks are focused on the IP the statistical attributes of lower layer information could not
and TCP layers instead of the high layer. They are not suitable distinguish the abnormal HTTP requests from the normal ones,
for handling the new type of attack which is based on the appli-
cation layer. In this paper, we introduce a new scheme to achieve the victim’s server soon collapsed. In this paper, we call such
early attack detection and filtering for the application-layer-based application-layer DDoS attacks “App-DDoS” attacks.
DDoS attack. An extended hidden semi-Markov model is proposed The challenge of detecting App-DDoS attacks is due to
to describe the browsing behaviors of web surfers. In order to the following aspects. (i) The App-DDoS attacks utilize high
reduce the computational amount introduced by the model’s large layer protocols to pass through most of the current anomaly
state space, a novel forward algorithm is derived for the online
implementation of the model based on the M-algorithm. Entropy
detection systems designed for low layers and arrive at the
of the user’s HTTP request sequence fitting to the model is used as victim webserver. (ii) App-DDoS attacks usually depend on
a criterion to measure the user’s normality. Finally, experiments successful TCP connections, which makes the general defense
are conducted to validate our model and algorithm. schemes based on detection of spoofed IP address useless.
Index Terms—Anomaly detection, browsing behaviors, DDoS, (iii) Flooding is not the unique way for the App-DDoS attack.
hidden semi-Markov Model, M-algorithm. There are many other forms, such as consuming the resources
of the server (CPU, memory, hard disk, or database), arranging
the malicious traffic to mimic the average request rate of the
I. INTRODUCTION normal users or utilizing the large-scale botnet to produce the
low rate attack flows. This makes detection more difficult.
D ISTRIBUTED denial of service (DDoS) attacks consti-
tute one of the major threats and are among the hardest
security problem facing today’s Internet [1]. Because of the se-
From the literature, few studies can be found that focus
on the detection of App-DDoS attacks. This paper develops
riousness of this problem, many defense mechanisms, based on a novel model to capture the browsing patterns of web users
statistical approaches [2]–[11], have been proposed to combat and to detect the App-DDoS attacks. Our contributions are
these attacks. in three aspects: (i) based on the hidden semi-Markov model
In statistical approaches to defend against DDoS attacks, the (HsMM) [13], [14], a new model is introduced for describing
statistics of packet attributes in the headers of IP packets, such the browsing behavior of web users and detection of the
as IP address, time-to-live (TTL), protocol type, etc., are mea- App-DDoS attacks; (ii) a new effective algorithm is proposed
sured and the packets deemed most likely to be attack, , are for implementation of the forward process of HsMM and
dropped based on these measurements. Such approaches often on-line detection of the App-DDoS attacks; and (iii) exper-
assume that there are some traffic characteristics that inherently iments based on real traffic are conducted to validate our
distinguish the normal packets from the attack ones. Therefore, detection method.
“abnormal” traffic can be detected based on those traffic char- This paper is organized as follows. In Section II, we describe
acteristics during a DDoS attack. the related works of our research. In Section III, we present the
Although the approaches based on statistics attributes of assumptions made in our scheme and establish a new model
TCP or IP layers are valuable for certain DDoS attacks (e.g., for our scheme in Section IV. Then, in Section V, we propose a
SYN/ACK flooding), they are not always workable for some novel detection algorithm based on HsMM. The whole scheme
special DDoS attacks which work on the application layer. is validated in Section VI by conducting an experiment using
This has been witnessed on the Internet in 2004, when a worm real data of traffic. Finally, in Section VII, we discuss some
virus named “Mydoom” [12] used pseudo HTTP requests to important issues on the scheme and conclude our work in
attack victim servers by simulating the behavior of browsers. Section VIII.

II. RELATED WORK


Manuscript received September 24, 2006; revised July 03, 2007 and October
23, 2007; approved by IEEE/ACM TRANSACTIONS ON NETWORKING Editor Most current DDoS-related studies focus on the IP layer
B. Levine. First published May 12, 2008; current version published February or TCP layer instead of the application layer. Mirkovic et al.
19, 2009. This work was supported by the Key Program of NSFC–Guangdong
Joint Funds (Grant U0735002), The National High Technology Research and
[2] devised a defense system called D-WARD installed in
Development Program of China (2007AA01Z449), and The Research Fund for edge routers to monitor the asymmetry of two-way packet
the Doctoral Program of Higher Education under Grant 20040558043. rates and to identify attacks. IP addresses [3] (which assume
The authors are with the Department of Electrical and Communication that attack traffic uses randomly spoofed addresses) and TTL
Engineering, Sun Yat-Sen University, Guangzhou 510275, China (e-mail:
xieyicn@163.com; syu@mail.sysu.edu.cn). values [4] were also used to detect the DDoS attacks. Statistical
Digital Object Identifier 10.1109/TNET.2008.923716 abnormalities of ICMP, UDP, and TCP packets were mapped
1063-6692/$25.00 © 2008 IEEE
XIE AND YU: A LARGE-SCALE HIDDEN SEMI-MARKOV MODEL FOR ANOMALY DETECTION ON USER BROWSING BEHAVIORS 55

to specific DDoS attacks based on the Management Informa- The main disadvantage of the first type of methods is that
tion Base (MIB) [5]. Limwiwatkul [6] discovered the DDoS they do not take into account the user’s series of operations in-
attacking signature by analyzing the TCP packet header against formation (e.g., which page will be requested in the next step).
the well-defined rules and conditions. Noh [7] attempted to Therefore, they cannot describe the browsing behavior of a user
detect attacks by computing the ratio of TCP flags to TCP because the next page the user will browse is primarily de-
packets received at a webserver and considering the relative termined by the current page he/she is browsing. The second
proportion of arriving packets devoted to TCP, UDP, and ICMP type of methods need intensive computation for page content
traffic. Basu [8] proposed techniques to extract features from processing and data mining, and hence they are not very suit-
connection attempts and classify attempts as suspicious or not. able for on-line detection. On the other hand, the click-streams
The rate of suspicious attempts over a day helped to expose data sets required for the algorithms are collected by additional
stealthy DoS attacks, which attempt to maintain effectiveness means (e.g., cookies or JAVA applets), which require the sup-
while avoiding detection. port of clients. The third type of methods omit the dwell time
Little existing work has been done on the detection of that the user stays on a page while reading and they do not con-
App-DDoS attacks. Ranjan et al. [9] used statistical methods to sider the cases that a user may not follow the hyperlinks pro-
detect characteristics of HTTP sessions and employed rate-lim- vided by the current page and some user’s HTTP requests re-
iting as the primary defense mechanism. Besides the statistical sponded by proxies cannot arrive at the server. Besides, when
methods, some researchers combated the App-DDoS attacks the total number of pages (i.e., the state space of Markovian) in
by “puzzle”. Kandula [10] designed a system to protect a web the website is very large, the algorithm would be too complex
cluster from DDoS attacks by (i) designing a probabilistic to be implemented in online detection. The last type of methods
authentication mechanism using CAPTCHAs (acronym for merely focus on system calls of a UNIX program, instead of
“Completely Automated Public Turing test to tell Computers web user browsing behavior. Obviously, system calls can be
and Humans Apart”) and (ii) designing a framework that opti- recorded completely by the log, but the observations of web
mally divides the time spent in authenticating new clients and users’ requests are incomplete because many HTTP requests
serving authenticated clients. are responded by proxies or caches, which cannot arrive at the
However, statistical methods can hardly distinguish the server and be recorded in the log. Thus, a new method designed
vicious HTTP requests from the normal ones. Hence, such for detection of App-DDoS attacks has to consider these issues.
methods can merely be used for anomaly monitoring but not
for filtering. Defenses based on puzzle, requiring each client to III. MODEL PRELIMINARIES AND ASSUMPTIONS
return the solution before accessing the site, are not effective The proposed scheme for App-DDoS attack detection and fil-
because (i) CAPTCHAs might annoy users and introduce tering is based on two results obtained by web mining [22]: (i)
additional service delays for legitimate clients; (ii) puzzles about 10% of the webpages of a website may draw 90% of the
also have the effect of denying web crawlers access to the access; (ii) web user browsing behavior can be abstracted and
site and disabling search engines which index the content. profiled by users’ request sequences. Thus, we can use a uni-
Furthermore, new techniques may render the puzzles solvable versal model to profile the short-term web browsing behavior,
using automated methods [16]. In [17], Jung et al. proposed and we only need the logs of webserver to build the model
methods to filter the HTTP-level offending DoS traffic. They without any additional support from outside of the webserver.
provided topological clustering heuristics for a web server to Another assumption of the proposed scheme is that it is dif-
distinguish denial of service attacks from flash crowd behavior ficult for an attacker or virus routine to launch an App-DDoS
according to the source IP address. However, IP addresses are attack by completely mimicking the normal web user access
subject to spoofing and using a white-list of source addresses behavior. This assumption is based on the following consider-
of legitimate clients/peers is difficult, since many hosts may ations. Usually, browsing behavior can be described by three
have dynamic IP addresses due to the use of NAT, DHCP, and elements: HTTP request rate, page viewing time, and requested
mobile-IP. sequence (i.e., the requested objects and their order). In order
web user behavior mining has been well studied. The existing to launch stealthy attacks, attackers can mimic the normal ac-
works can be summarized as the following four types. The first cess behavior by simulating the HTTP request rate, the viewing
is based on probabilistic model, e.g., [18], which use a Zipf-like time of webpage, or the HTTP request sequence. The first two
distribution to model the page jump-probability, a double Pareto simulations can be carried out by HTTP simulation tools (e.g.,
distribution for the link-choice, and a log-normal distribution [23]).The HTTP attack traffic generated this way has similar sta-
for the revisiting, etc. The second is based on click-streams and tistical characteristics as the normal ones, but it cannot simulate
web content, e.g., [19], which uses data mining to capture a web the process of the user browsing behavior because it is a pre-de-
user’s usage patterns from the click-streams dataset and page signed routine and cannot capture the dynamics of the users
content. The third is based on the Markov model, e.g., [20], and the networks. Besides, the low rate, stealthy, attack requires
which uses Markov chains to model the URL access patterns more infected computers, which needs the help of a large-scale
that are observed in navigation logs based on the previous state. botnet. Another option for attack is simulating the HTTP re-
The fourth applies the user behavior to implement anomaly de- quest sequence. Four options can be used to carry out this in-
tection, e.g., [21], which uses system-call data sets generated by tention, including stochastic generation, presetting, interception
programs to detect the anomaly access of UNIX system based with replaying, and remote control. Using the first method, at-
on data mining. tackers (routines) generate the HTTP request objects randomly,
56 IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. 17, NO. 1, FEBRUARY 2009

which makes the attackers’ requested contents and orders dif-


ferent from the normal users’. By the second method, an attack
sponsor has to embed a HTTP request list into the routine be-
fore the worm virus is propagated. Thus, when the infected com-
puters launch attacks through the presetting HTTP request list,
one visible characteristic is that many users browse/request the
same web objects periodically and repeatedly. Using the third
method, the worm virus launches attack by intercepting the re-
quest sequences sent by the browser of an infected computer and
replaying them to the victim webserver. However, because the
Fig. 1. Browsing behavior.
users of infected computers may not access the victim website,
it is not easy for this method to be successful. The last way is
that the sponsor of attacks establishes a center control node and
communicates with the attack routines residing in the infected
computers periodically. By this means, the attacker can control
the attack behavior of each zombie completely, e.g., distribute
the new HTTP attack sequences or vary the attack mode to con-
ceal its baleful behaviors. However, though the attacker may be
able to control its botnet without being detected by anti-virus
software on the client-side, we can suppose that the attack con-
troller is unable to obtain historical access records (i.e., logs)
from the victim webserver or stay in front of the victim to inter-
cept all of the incoming HTTP requests. This makes it difficult
to launch an attack that simulates the dynamics of normal user
behaviors, even though it is able to capture a few of the users’
request sequences and replay them repeatedly. Therefore, the Fig. 2. Web browsing model.
attacker can be detected by server-side intrusion detection sys-
tems using the user’s behavior profiles. Besides these two factors, the Internet proxies and caches
Among the existing attack modes, the most simple and effec- (including the browser cache) are other important issues af-
tive way of App-DDoS attacks is utilizing the HTTP “GET /” to fecting the server-side’s observations on the user behavior.
launch attacks by requesting the homepage of the victim web- These proxies and caches may respond to the HTTP requests
site repeatedly. Because such attack merely needs to know the emitted by the browser, and thus some of the user’s requests
domain name of the victim website without specifying the URL may not arrive at the webserver. This leads to different logs of
of each object, attacker can launch the offensive easily. Fur- requests for the same page being accessed by different users at
thermore, the homepage usually is the most frequently accessed different times or through different paths.
webpage and has a lot of embedded objects, thus, attacking In considering the factors described above, we introduce our
through homepage is effective and not easily to be detected. web browsing model, as shown in Fig. 2. When a user clicks a
hyperlink pointing to a page, the browser will send out a number
IV. MODEL AND ALGORITHM of requests for the page and its in-line objects. These requests
may arrive at the webserver in a short time interval, except those
A. Basic Ideas for Modeling Web Browsing Behaviors responded by proxies or caches. Which and how many requests
The browsing behavior of a web user is related to two factors: of the clicked page can arrive at the webserver are not determin-
the structure of a website, which comprises of a huge number istic. Because the inline objects are overlapped among pages,
of web documents, hyperlinks, and the way the user accesses the real clicked pages of the user cannot be observed directly
the webpages. Usually, a typical webpage contains a number of on the server-side, which can only be estimated from the re-
links to other embedded objects, which are referred to as in-line ceived request sequence. In this paper, the estimation is based
objects. A website can be characterized by the hyperlinks among on data clustering. Because the HTTP OFF period is usually
the webpages and the number of in-line objects in each page. much longer than the HTTP ON, the requested objects in the
When users click a hyperlink pointing to a page, the browser observation sequence can be grouped into different clusters ac-
will send out a number of requests for the page and its several cording to this characteristic. Each group presents a webpage.
in-line objects (Fig. 1). We call it the “HTTP ON” period. Fol- Then, the user’s request sequence (e.g., , , , , , , ,
lowing the “HTTP ON” is the user’s reading time which is called , , , in Fig. 2) is transformed into the corresponding
“HTTP OFF”. Then, the user may follow a series of hyperlinks group sequence (e.g., in Fig. 2). The order or transition
provided by the current browsing webpage to continue her/his of consecutive groups implies the user’s click behavior when
access. The alternative way the user accesses the web site is to browsing from one page to another.
jump from one page to another by typing the URLs in the ad- In Sections IV-B and IV-C, we will introduce a new HsMM
dress bar, using the navigation tools, or selecting from the fa- to describe the dynamic web access behavior and implement the
vorites of the browser. anomaly detection and filtering for the App-DDoS attack.
XIE AND YU: A LARGE-SCALE HIDDEN SEMI-MARKOV MODEL FOR ANOMALY DETECTION ON USER BROWSING BEHAVIORS 57

B. Hidden Semi-Markov Model


HsMM is an extension of the hidden Markov model (HMM)
with explicit state duration. It is a stochastic finite state machine,
specified by ( , , , ) where:
• is a discrete set of hidden states with cardinality , i.e.,
;
• is the probability distribution for the initial state
, denotes the state that the system takes at
time and . The initial state probability distribution
satisfies ;
• is the state transition matrix with probabilities:
, , , and the state transition Fig. 3. State transitions.
coefficients satisfy ;
• is the state duration matrix with probabilities:
, denote the remaining (or residual) , and denote them as a state se-
time of the current state , , , quence . The state transition probability
is the maximum interval between any two consecutive represents the probability that page 2 may be accessed by the
state transitions, and the state duration coefficients satisfy user after accessing the current page 1. The duration of the first
. state is , which means 4 HTTP requests of page 1
Then, if the pair process takes on value , the arrived at the webserver, i.e., .
semi-Markov chain will remain in the current state until time The web browsing behavior can be formulated in terms of
and transits to another state at time , where the HsMM as follows. We use a random vector
. The states themselves are not observable. The acces- to denote the observation at the th request in the request se-
sible information consists of symbols from the alphabet of ob- quence, where is the index of the object
servations where denotes the observable that the th request intends to get from the original web sever,
output at time and is the number of samples in the ob- such as an HTML page or an embedded image, and
served sequences. For every state an output distribution is given is the interval between and . is the size of
as where and accessible object set. forms the observation vector sequence
. is the size of the observable output set. , where is the total length of the ob-
The output probabilities satisfy . servation sequence. We use a underlying semi-Markov process
when the “conditional independence” of outputs to describe the user’s click behavior, where
is assumed, where represents the is the index of the webpage browsed by the user at the
observation sequence from time to time . Thus, the set of th request. is the Markov state space which presents webpage
HsMM parameters consists of initial state distributions, the set of the website. We use to denote the size of . Transition
state transition probabilities, the output probabilities and state from to is considered as the user’s browsing behavior
duration probabilities. For brevity of notation, they can be de- from one webpage to another, following a hyperlink between
noted as . The algo- the two pages. The state transition probability means
rithms of HsMM can be found in [13] and [14]. the probability of the user browsing from page to page .
If , then we assume the th request and
C. Problem Formulation the th request are made for the same page.
In considering that the user may jump from one page to
We use the Markov state space to describe the webpage another page by typing URLs or opening multiple windows
set of the victim website. Each underlying semi-Markov state without following the hyperlinks on the webpages in browsing
is used to present a unique webpage clicked by a web user. the website, we introduce a Null page which is denoted as
Thus, the state transition probability matrix presents the to describe these browsing behaviors. That is, if page
hyperlink relation between different webpages. The duration has no direct link to page (i.e., the transition probability
of a state presents the number of HTTP requests received by ), and the user jumps from to directly in browsing
the webserver when a user clicks the corresponding page. the website, we then assume that the user transits from page
The output symbol sequence of each state throughout its to the Null page at first, and then transits from the Null
duration presents those requests of the clicked page which page to page with the transition probabilities and
pass through all proxies/caches and finally arrive at the web- , respectively, as shown in Fig. 3. Correspondingly, the
server. We use a simple example to explain these relations state space is extended as . In addition,
by Fig. 2. The unseen clicked page sequence is (page 1, page we use another random variable to denote the remaining
2, page 3). Except those responded by the proxies or caches, (or residual) number of requests of the current webpage .
the corresponding HTTP request sequence received by the Then the browsing behaviors of web users can be described by
webserver is . When HsMM, and the state sequence and the model parameters
the observed request sequence is inputted to the HsMM, the can be estimated from the observed sequence of the web
algorithm may group them into three clusters , users’ requests.
58 IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. 17, NO. 1, FEBRUARY 2009

In order to handle the newly introduced Null state and reduce


the computations, we extend the re-estimation algorithm pro-
posed in [13] to the new one for web browsing behavior. The
improved algorithm is presented as follows.
We first define the forward and backward variables and then, (12)
according to the state transition diagram shown in Fig. 3, derive
the forward and backward formulas as follows. where we have implicitly considered that , and
The forward variables and the forward formulas:

(13)
(1)
(2) where , , , , and is the
Null State.
(3) Now we yield the estimation formulas based on these
variables. Considering the observation variables and
The backward variables and the backward formulas: in are independent of each other, then we have
. Furthermore, because our anomaly
detection method will have multiple observation sequences
(4) on multiple user behaviors, we derive the re-estimation algo-
rithm of the HsMM for multiple observation sequences by
the frequency in this paper. Assuming there are observation
sequences with different lengths , we obtain
(5)
the extended re-estimation formulas:

(14)
(6)
(7) (15)

where is the Null state between and when ,


, , , and is the Null state. (16)
Using the forward and backward variables defined above, we
define several other variables as follows:
(17)

(18)

(8)
(19)

(9)
(20)

(10)
where , , , , ,
, and , , for all , , the
variables whose superscript are belong to the th user, and
(11) , , , , and are the normaliza-
tion factors, such that , ,
XIE AND YU: A LARGE-SCALE HIDDEN SEMI-MARKOV MODEL FOR ANOMALY DETECTION ON USER BROWSING BEHAVIORS 59

The M-algorithm is being widely adopted in decoding dig-


ital communications because it requires far fewer computations
than the Viterbi algorithm. The aim of the M-algorithm is to find
a path with distortion or likelihood metrics as good as possible
(i.e., minimize the distortion criterion between the symbols as-
sociated to the path and the input sequence). However, contrary
to the Viterbi algorithm, only the best paths are retained at
Fig. 4. Filter based on behavior model.
every instant in the M-algorithm. It works as follows. At time
, only the best paths are retained for extension. Associated
to each path is a value called the path metric, which is the accu-
, , and , re- mulation of transition metrics and acts as a distortion measure
spectively. We define the average logarithmic likelihood per ob- of the path. The transition metric is the distortion introduced by
served symbol, , as the measure of nor- the distance between the symbols associated to a trellis transi-
mality of the observation sequence, and call it “entropy” in the tion and the input symbol. The path metric is the criterion to
rest of this paper. select the best paths. Then the M-algorithm moves forward
to the next time instant by extending the paths it has re-
V. ANOMALY DETECTION tained to generate new paths. All the terminal branches are
A. Normality Detection and Filter Policy compared to the input data corresponding to this depth metric,
and the poorest paths are deleted. This process is
Using the above re-estimation algorithm, we establish the
repeated until all the input sequences have been processed.
new HsMM to describe normal web users’ browsing behaviors
We apply the M-algorithm into our on-line anomaly detection
by training the model from a set of request sequences made by a
lot of normal users. We define the deviation from the mean en- based on the following considerations. (i) We have mentioned
tropy of training data as the abnormality of an observed request in Section III that there are only a small number of pages that
sequence made by a user. The smaller the deviation, the higher are referenced frequently. Since we use the Markov states to
the normality of the observation sequence. describe the clicked webpages, this means many rarely visited
Fig. 4 shows the detector and the filter that are based on the states (or, say, pages) are redundant. (ii) Based on (20), we see
behavior model. The filter, residing between the Internet and the that the entropy of an observation sequence can be computed by
victim, takes in a HTTP request and decides whether to accept the forward variables . Thus, we only need to optimize
or reject (drop) it. If a request is accepted, it can pass through the the forward process of HsMM, which is very suitable for using
filter and reach the victim. In Section VI, we will use the false the M-algorithm to reduce the computations.
negative ratio (FNR) to evaluate the performance of the filter. Hence, in this paper, we limit the number of states survived
We have designed a self-adaptive algorithm for on-line update in computing the entropy of the browsing path of a user. We
of the HsMM in our previous work [26]. Due to the constraint in define as the set of states survived in the th forward step
computing power, the detector/filter is unable to adapt its policy and the set of states that can transit directly from one of
rapidly. Because the web access behavior is short-term stable the states in . Thus, is selected from according to
[22], the filter policy must be fixed for only a short period of the M-algorithm. The M-algorithm for computing the entropy
time, which will be called a slot in the rest of this paper. Define of the th user sequence is as follows.
as the length of the request sequence for anomaly detection. Step 1. Select an initial subset containing states, such
For a given HTTP request sequence of the th user, we compute that for each , for all and let
the average entropy of the user’s request sequences over the and , where is
period of . Then we calculate the deviation of the average a given number.
entropy from the mean entropy of the model. If the deviation Step 2. Compute for all by (1)–(3).
is larger than a predefined threshold, the user is regarded as an Step 3. Select a -states subset , such that for each
abnormal one, and the request sequence will be discarded by , for all
the filter when the resource (e.g., memory and bandwidth) is and let ;
scarce. Otherwise, the user’s HTTP request can pass through the Step 4. Set ; and repeat step 2 and step 3 until
filter and arrive at the victim smoothly. Then, when the given
where is the length of th user’s sequence;
slot is timed out, the model can implement the on-line update
Step 5. Use (20) to compute the entropy of the user’s obser-
by the self-adaptive algorithm proposed in [26]. This enables
vation sequence.
the model to achieve a long-term automatic running. We will
Figs. 5 and 6 show the pseudocodes of the proposed scheme
discuss this issue in detail in Section VII.
which includes model training and anomaly detection.
B. On-Line Algorithm for the Computation of Normality
The total number of states is very large, which requires C. Computational Complexity
a lot of computations for anomaly detection. To implement the In the training phase, our algorithm needs to estimate about
algorithm online, we introduce a new algorithm based on the parameters. For example, in our
M-algorithm [15] to solve this issue. experiment is about 1500 (pages), is about 4000 (objects)
60 IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. 17, NO. 1, FEBRUARY 2009

Fig. 7. Popularity of pages.

VI. EXPERIMENT RESULTS AND ANALYSIS


In this section, we used the busiest 6-hour trace from an ed-
ucational website to validate our anomaly detection algorithms.
Among the selected data, we randomly used two thirds of the
user request sequences (about 4700 sequences) within the first
Fig. 5. Pseudocode for the extended HsMM training. two hours to implement the HsMM training. The remainder data
are used for testing. The model was performed on a computer
with an Intel Pentium IV 2.80 GHz CPU, 1 Gbyte of RAM.

A. Initialization of Model Parameters


In our HsMM, a state denotes a page, transition among states
denotes hyperlinks among different pages, and emissions of a
state are objects forming the corresponding page. Therefore, the
state transition matrix is determined by the website’s structure
of hyperlinks and the emission/observation matrix is determined
by the objects embedded in the pages. In the model training
phase, we only needed to determine the state transition proba-
bilities that describe how frequently the users browse from one
page to another, and the emission/observation probabilities for
each state that describe how often the requests for the objects
forming a page can arrive at the original webserver. Hereby, we
extract the website’s basic information, e.g., hyperlink structure
and each page’s objects, to initialize the model.
Fig. 6. Pseudocode for anomaly detection. Errors exist for dynamic pages whose hyperlinks or in-line
objects are dynamically produced and determined by the scripts
(e.g., Java script or VB script) because we cannot explore all
and is 15. During our experiment, we find that most elements possible transitions among dynamic pages. Therefore, we limit
of the model parameters, i.e., , , , are a state to denote a static page and use a macro state to represent
zeros. So we can apply a sparse matrix to express these vari- all dynamic pages.
ables. This method can save memory and improve the computa-
tional efficiency in the practical implementation. In the anomaly B. Characteristics of Web Browsing Behaviors
detection phase, the most important factor is the computational We first did some statistics for the collected traffic data. As
complexity in computing the user’s entropy. By using the new shown in Fig. 7, about 100 pages (states) draw 98% of web ac-
algorithm introduced in the last subsection, only a subset of the cess in the website. This result is similar to most previous works
forward variables are needed to be computed. Let on web mining. Therefore, the most frequently appearing states
the order of is , i.e., the number of states that can are limited to about 100. This implies that the M-algorithm is
transit directly from one of the states in . Then the com- suitable for reducing the computational amount in anomaly de-
putational complexity of the algorithm is , where tection online. We then used the HsMM to estimate the state
. Because each page usually contains a limited number sequences of users from the collected data. Fig. 8 shows the
of hyperlinks to other pages, say 20, we usually have . distribution of the number of states (or pages) experienced in
Besides, memory units are needed for browsing the website by each user. There are several peaks in
the model parameters and for the forward variables. We the plot and the biggest one is at about 30 pages. Most request
do not need to store the observations. Therefore, we can com- sequences’ length of our experiment data is within [20, 40]. This
pute the entropies for all observation sequences efficiently and implies that the browsing behaviors of the users can be classi-
implement the on-line anomaly detection. fied into five groups and the users in each group have the similar
XIE AND YU: A LARGE-SCALE HIDDEN SEMI-MARKOV MODEL FOR ANOMALY DETECTION ON USER BROWSING BEHAVIORS 61

Fig. 8. Distribution of sequence length. (a) Distribution of page sequence


length. (b) Distribution of object sequence length. Fig. 10. Arrival rate versus time. (a) Arrival rate versus time of traffic without
attack. (b) Arrival rate versus time of traffic with attack.

Fig. 9. Histogram of state duration.

browsing behavior. Fig. 9 shows the distribution of the browsing


time for each page. It resembles Pareto distribution.

C. App-DDoS Attacks Detection


We simulate the App-DDoS attack based on [24] and [25].
We assume the attack coverage is about 0.025, which is defined
as the ratio of attack clients to whole clients. Each node attacks
the webserver by sending GET requests with random objects.
We insert the App-DDoS attack requests into the normal traffic
shown in Fig. 10(a). As shown in Fig. 10(b), the App-DDoS at- Fig. 11. Arrival rate of normal and abnormal traffic. (a) Distribution of HTTP
tacks persist about two hours from 8725 s to16250 s. In order arrival rate. (b) QQplot of arrival rate distribution.

to generate a stealthy attack which is not easily detected by


the traditional methods, we shape each attack node’s output
traffic to approximate the average request rate of normal user.
The App-DDoS attack aggregated from the low-rate malicious
traffic as shown in Fig. 10(b), the histogram of arrival rate distri-
bution of normal users and attack clients is shown in Fig. 11(a),
and the QQplot of the histogram is shown in Fig. 11(b). Ob-
viously, the statistics of these characteristics for both normal
clients and attackers are similar. Thus, the frequency analysis
and the average arrival rate analysis are not suitable for distin-
guishing the low-rate attackers from the normal users. We then
apply the proposed scheme to the detection of App-DDoS at-
tacks. Fig. 12 shows the histogram of entropies of the normal Fig. 12. Entropy distribution.
users (curve a) and the abnormal users (curve b). There are sig-
nificant differences in the entropy distributions between these
two groups: the entropy for most of the normal users is larger between 8 5 . Therefore, the model can distinguish the at-
than 5, but for the attack nodes, it is less than 5 and is mainly tackers from normal users by their entropies.
62 IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. 17, NO. 1, FEBRUARY 2009

TABLE I when the duration of continuous HTTP request sequence is


STATISTICAL RESULTS OF ENTROPY DISTRIBUTION larger than 100 seconds, entropies of normal users centralize
to 4 2 while the attack sequences’ entropies centralize
to 8 6 . Hence, the best response time used to detect an
abnormal user is about 100 seconds.

D. Efficiency of the Detection Algorithm


In the experiment, there are 1493 pages (i.e., the size of )
and 4130 accessed objects (i.e., the size of ). We assume is
15 and for the M-algorithm changes from 25 to 1493 with
increment of 25. As shown in Fig. 16, when is 500 (about
1/3 of the total states), the average entropy is about 2.803,
and the difference in the average entropy between the classical
algorithm and the new algorithm is very small (about 0.002).
This shows that we can use the new algorithm with small to
compute the user’s entropy, which can reduce the computational
complexity and memory requirement significantly in the on-line
Fig. 13. Cumulate distribution of average entropies. implementation. As for the training phase, Fig. 17 shows that the
algorithm can converge at a point after about 10 iterations. The
timer shows the execution time of model parameter estimation
is less than 20 minutes in our experiment.

VII. DISCUSSION
Several related issues of the proposed scheme are discussed
in this section.

A. Expansion of the Model


In our experiment, we only consider the attack manner whose
request rate and length are similar with those of normal users’.
Fig. 14. ROC curve. For other special attacks, e.g., the length of attack sequence less
than 150 and/or arranging each vicious sequence to be not more
than 100 seconds, the above scheme should be extended. We
The statistical results of the entropy of normal training data show its main idea based on the experiment results. Fig. 8 shows
and emulated App-DDoS attacks are given in Table I. Fig. 13 that the length distribution of requested page sequences has sev-
shows that if we take 6.0 for the threshold value of normal eral peaks in the histogram and the biggest one is located at the
web traffic’s average entropy, the false negative ratio (FNR) point of about 30 pages, which imply the browsing behaviors of
is about 1%, and the detection rate is about 98%. Fig. 14 is the users can be classified into five groups (i.e., ). Thus
the receiver operating characteristics (ROC) curves showing in this case , in order to obtain more accurate detection results,
the performance of our anomaly detection algorithms on five HsMMs are required. Then, the detection is implemented
App-DDoS attacks.In our scheme, user’s normality depends according to Fig. 18. This way the incoming sequences of any
on the entropy of his/her HTTP sequence fitting to the model. length (including the short attacks issuing 10 requests for pages)
Thus, we need a threshold of the sequence’s length to decide can also be detected. The expansion method includes three as-
whether the sequence is normal or not. We call this threshold pects, as follows.
“decision length” which includes two issues: (i) the amount First, we cluster the web users’ sequences (i.e., training data)
of the request in the sequence, “sequence decision length”; into groups by analyzing the distribu-
(ii) the time of the HTTP sequence, “time decision length”. tion of their request sequence length. The groups are indexed
Decision length may affect the real-time response time and by their length distribution, i.e., denotes the group in which
precision of our detection system. The shorter the decision the average length of users’ request sequences is shortest, de-
length, the better the real-time response, with less reliability. notes the group with second shortest average length of request
Fig. 15 shows the entropies varying with two types of decision sequences and so on, is the last group with longest average
length. Fig. 15(a) and (c) present the entropy versus sequence length of request sequences.
length for normal users and attack clients, respectively. We Second, model is trained by the group
can see the entropies converge with the increase of sequences’ and is used to describe the access behavior characteristics of .
length. When the length is larger than 150, the normal users’ Then, an increasing length of incoming request sequence
entropies centralize to 4 2 while those of attack sequences is examined by , then , , and so forth, as shown
centralize to 8 6 . Thus, the best sequence decision length in Fig. 18. In any stage, the system can issue an alarm for a
for this trace is about 150. Fig. 15(b) and (d) present the en- suspicious client if the deviation of her/his access behavior
tropy versus time for normal users and attack clients. Similarly, from the model is beyond the acceptable threshold.
XIE AND YU: A LARGE-SCALE HIDDEN SEMI-MARKOV MODEL FOR ANOMALY DETECTION ON USER BROWSING BEHAVIORS 63

Fig. 15. Varieties of users’ likelihood. (a) Entropy versus index of request for normal users. (b) Entropy versus time for normal users. (c) Entropy versus index of
request for attack sequences. (d) Entropy versus time for attack sequences.

suitable to describe the web browsing behavior, e.g., the rela-


tion between webpage and in-line objects, the effect caused by
the middle proxies and caches. Last, as we have discussed in
Section V, based on the forward-backward algorithm and the
proposed M-algorithm for HsMM, both computational com-
plexity and memory requirement of the improved model are
lower than the data mining, thus it can satisfy the requirement
of on-line detection.
Another aspect of the model is how to ensure the normality
of training data and keep the model updated with changes in the
M normal users’ behavior. Our solution is based on the unsuper-
M
Fig. 16. Performance of the new algorithm. (a) log(likelihood) of different .
(b) log(likelihood) difference of different between the new algorithm and the vised learning theory [22] and the dynamic algorithm of HsMM
classical algorithm. [26], i.e., we use the supervised method in building the initial
model and the unsupervised method in collecting new training
data for updating the model. By this way, the current model is
used to select the training data set for deriving the next model,
and then based on the dynamic algorithm of HsMM proposed in
our previous work [26], the model can carry out the on-line pa-
rameters update and trace the latest characteristics of the normal
users’ browsing behaviors. The update can be implemented in
a relatively long time period, say ten minutes, because network
behavior can be considered stationary in that interval as reported
in the literature.

Fig. 17. Convergence of model training.


C. Deviation of Model
Central to the notion of our scheme is the idea of being able
B. About the Model to detect “outliers” or abnormal browsing behavior by com-
HsMM is applied to describe the web user behavior. It has puting entropy of the HTTP sequence against the model. If the
been successfully applied in many fields, such as machine threshold that is used to measure entropy as abnormal is not
recognition, sequences clustering, mobility tracking in wireless properly set, too many false alarms will be issued. Here, we
networks, and inference for structured video sequences. The present a method to determine the proper threshold of entropy
most notable characteristic of HsMM is that it can be used to for anomaly detection.
describe most physical signals without many hypotheses. Fur- In the experiments of this paper, the DR is 98% and the FPR
thermore, the non-stationary and the non-Markovian properties is 1.5% when the threshold of entropy is set to 6.0. However,
of HsMM can best describe the self-similarity or long-range when the system is put into real application, we cannot preset
dependence of network traffic that has been proved by vast this threshold. What we know is just the entropy distribution of
observations on the Internet [14], which quite benefits our the normal request sequences and the FPR. We suggest a gen-
study. On the other hand, the structure of HsMM is also very eral method to determine the threshold. The second method is
64 IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. 17, NO. 1, FEBRUARY 2009

Fig. 18. Flow chart of detection based on hierarchical multi-model scheme.

TABLE II the App-DDoS attacks mimicking behavior of proxy server in-


GAUSSIAN DISTRIBUTIONS’ DETECTION THRESHOLD stead of individual web surfers; and 3) applying the proposed
scheme to detect other application-layer based attacks, such as
FTP attacks.

ACKNOWLEDGMENT
The authors would like to thank the anonymous referees for
their helpful comments and suggestions to improve this work.
They also thank the Network Center of Sun Yat-Sen University
for their help with the access logs.

based on the Central Limit Theorem. That is, given a distribu- REFERENCES
tion with a mean and variance, the sampling distribution ap-
[1] C. Douligeris and A. Mitrokotsa, “DDoS attacks and defense mech-
proaches a Gaussian distribution. Thus, we can describe entropy anisms: Classification and state-of-the-art,” Computer Networks: The
distribution of the training data using the following Gaussian Int. J. Computer and Telecommunications Networking, vol. 44, no. 5,
distribution: pp. 643–666, Apr. 2004.
[2] J. Mirkovic, G. Prier, and P. L. Reiher, “Attacking DDoS at the
source,” in Proc. 10th IEEE Int. Conf. Network Protocols, Sep. 2002,
pp. 312–321.
[3] C. Jin, H. Wang, and K. G. Shin, “Hop-count filtering: An effective
where is the mean and is the variance. In considering that defense against spoofed traffic,” in Proc. ACM Conf. Computer and
Communications Security, 2003, pp. 30–41.
error level could give us a confidence interval of 99.7%, it [4] T. Peng, K. R. mohanarao, and C. Leckie, “Protection from distributed
should be good enough for even high precision detection sce- denial of service attacks using history-based IP filtering,” in Proc. IEEE
narios. Table II lists the detection threshold setting and their Int. Conf. Communications, May 2003, vol. 1, pp. 482–486.
[5] J. B. D. Cabrera et al., “Proactive detection of distributed denial of
corresponding FPR and DR of our experiments (the mean and service attacks using MIB traffic variables a feasibility study,” in Proc.
variance have been given in Section VI). As indicated in this IEEE/IFIP Int. Symp. Integrated Network Management, May 2001, pp.
table, the detection level could be safely selected to be 609–622.
[6] L. Limwiwatkul and A. Rungsawangr, “Distributed denial of service
and this choice ensures us with FPR smaller than 2.0% and DR detection using TCP/IP header and traffic measurement analysis,” in
larger than 97%. Int. Symp. Communications and Information Technologies 2004 (ISCIT
2004), Sappom, Japan, Oct. 29, 2004.
VIII. CONCLUSIONS AND FUTURE WORK [7] S. Noh, C. Lee, K. Choi, and G. Jung, “Detecting distributed denial of
service (DDoS) attacks through inductive learning,” Lecture Notes in
In this paper, we focused on the detection of App-DDoS at- Computer Science, vol. 2690, pp. 286–295, 2003.
[8] R. Basu, K. R. Cunningham, S. E. webster, and P. R. Lippmann, “De-
tacks and presented a new model to describe browsing behavior tecting low-profile probes and novel denial of service attacks,” in Proc.
of web users based on a large-scale Hidden semi-Markov 2001 IEEE Workshop on Information Assurance and Security, Jun.
Model. A new on-line algorithm based on the M-algorithm 2001, pp. 5–10.
[9] S. Ranjan, R. Swaminathan, M. Uysal, and E. Knightly, “DDoS-re-
was designed for the anomaly detection. A set of real traffic silient scheduling to counter application layer attacks under imperfect
data collected from an educational website and a generated detection,” in Proc. IEEE INFOCOM, Apr. 2006 [Online]. Available:
App-DDoS attack traffic were used to validate our model. The http://www-ece.rice.edu/~networks/papers/dos-sched.pdf
[10] S. Kandula, D. Katabi, M. Jacob, and A. W. Berger, Botz-4-Sale: Sur-
experiment results showed that when the detection threshold of
viving organized DDoS attacks that mimic flash crowds Mass. Inst.
entropy is set as , the DR is 98% and the FPR is 1.5%, Technol., Tech. Report TR-969, 2004 [Online]. Available: http://www.
which demonstrated that the model could be used effectively usenix.org/events/nsdi05/tech/kandula/kandula.pdf,
to describe the browsing behavior of normal web users and [11] R. K. C. Chang, “Defending against flooding-based distributed de-
nial-of-service attacks: A tutorial,” IEEE Commun. Mag., pp. 43–51,
detect the App-DDoS attacks. We also showed that the new Oct. 2002.
algorithm can reduce the memory requirement and improve the [12] MyDoom virus. [Online]. Available: http://www.us-cert.gov/cas/
computational efficiency. techalerts/TA04-028A.html
[13] S.-Z. Yu and H. Kobayashi, “An efficient forward-backward algorithm
Several issues will need further research: 1) coping with the for an explicit duration hidden Markov model,” IEEE Signal Process.
attacks launched by dynamic webpage (e.g., script); 2) detecting Lett., vol. 10, no. 1, pp. 11–14, Jan. 2003.
XIE AND YU: A LARGE-SCALE HIDDEN SEMI-MARKOV MODEL FOR ANOMALY DETECTION ON USER BROWSING BEHAVIORS 65

[14] S. Z. Yu, Z. Liu, M. Squillante, C. Xia, and L. Zhang, “A hidden [25] K. Jiejun et al., “Random flow network modeling and simulations for
semi-Markov model for web workload self-similarity,” in Proc. 21st DDoS attack mitigation,” in Proc. IEEE Int. Conf. Communications
IEEE Int. Performance, Computing, and Communications Conf. (ICC ’03), May 2003, vol. 1, pp. 487–491.
(IPCCC 2002), Phoenix, AZ, Apr. 2002, pp. 65–72. [26] X. Yi and Y. Shunzheng, “A dynamic anomaly detection model for web
[15] J. B. Anderson and S. Mohan, “Sequential coding algorithms: A survey user behavior based on HsMM,” in Proc. 10th Int. Conf. Computer Sup-
and cost analysis,” IEEE Trans. Commun., vol. COM-32, pp. 169–176, ported Cooperative Work in Design (CSCWD 2006), Nanjing, China,
Feb. 1984. May 2006, vol. 2, pp. 811–816.
[16] G. Mori and J. Malik, “Recognizing objects in adversarial clutter:
Breaking a visual captcha,” in Proc. IEEE Computer Society Conf.
Computer Vision and Pattern Recognition, Jun. 2003, vol. 1, pp.
134–141.
[17] J. Jung, B. Krishnamurthy, and M. Rabinovich, “Flash crowds and de-
nial of service attacks: Characterization and implications for CDNS Yi Xie received the B.S. and M.S. degrees from Sun
and web sites,” in Proc. 11th IEEE Int. World Wide web Conf., May Yat-Sen University, Guangzhou, China.
2002, pp. 252–262. From July 1996 to July 2004, he was working
[18] S. Bürklen et al., “User centric walk: An integrated approach for mod- with the Civil Aviation Administration of China
eling the browsing behavior of users on the web,” in Proc. 38th Annu. (CAAC). He returned to academia in 2004. He is
Simulation Symp. (ANSS’05), Apr. 2005, pp. 149–159. currently working toward the Ph.D. degree at Sun
[19] J. Velásquez, H. Yasuda, and T. Aoki, “Combining the web content and Yat-Sen University. He was a visiting scholar at
usage mining to understand the visitor behavior in a web site,” in Proc. George Mason University during 2007 to 2008. His
3rd IEEE Int. Conf. Data Mining (ICDM’03), Nov. 2003, pp. 669–672. research interests are in network security and web
[20] D. Dhyani, S. S. Bhowmick, and W.-K. Ng, “Modelling and predicting user behavior model algorithms.
web page accesses using Markov processes,” in Proc. 14th Int. Work-
shop on the Database and Expert Systems Applications (DEXA’03),
2003, pp. 332–336.
[21] X. D. Hoang, J. Hu, and P. Bertok, “A multi-layer model for anomaly
intrusion detection using program sequences of system calls,” in Proc. Shun-Zheng Yu (M’08) received the B.Sc. degree
11th IEEE Int. Conf. Networks, Oct. 2003, pp. 531–536. from Peking University, Beijing, China, and the
[22] M. Kantardzic, Data Mining Concepts, Models, Methods And Algo- M.Sc. and Ph.D. degrees from the Beijing University
rithm. New York: IEEE Press, 2002. of Posts and Telecommunications, China.
[23] J. Cao, W. S. Cleveland, Y. Gao, K. Jeffay, F. D. Smith, and M. Weigle, He was a visiting scholar at Princeton University
“Stochastic models for generating synthetic HTTP source traffic,” in and IBM Thomas J. Watson Research Center during
Proc. IEEE INFOCOM 2004, vol. 3, pp. 1546–1557. 1999 to 2002. He is currently a Professor at the
[24] A. Sarika, A. Saumya, and G. Bryon, “DDoS attack simulation, mon- School of Information Science and Technology,
itoring, and analysis,” CS 590D: Security Topics in Networking and Sun Yat-Sen University, Guangzhou, China. His
Distributed Systems Final Project Report, Apr. 29, 2004, Purdue Uni- recent research interests include networking, traffic
versity, West Lafayette, IN. [Online.] Available: http:// www.cs.purdue. modeling, anomaly detection and algorithms for
edu/homes/ bgloden/DDoS_Attack_Simulation.pdf. hidden semi-Markov models.

Вам также может понравиться