Академический Документы
Профессиональный Документы
Культура Документы
1, FEBRUARY 2009
Abstract—Many methods designed to create defenses against Because the DDoS defense mechanisms that are based on
distributed denial of service (DDoS) attacks are focused on the IP the statistical attributes of lower layer information could not
and TCP layers instead of the high layer. They are not suitable distinguish the abnormal HTTP requests from the normal ones,
for handling the new type of attack which is based on the appli-
cation layer. In this paper, we introduce a new scheme to achieve the victim’s server soon collapsed. In this paper, we call such
early attack detection and filtering for the application-layer-based application-layer DDoS attacks “App-DDoS” attacks.
DDoS attack. An extended hidden semi-Markov model is proposed The challenge of detecting App-DDoS attacks is due to
to describe the browsing behaviors of web surfers. In order to the following aspects. (i) The App-DDoS attacks utilize high
reduce the computational amount introduced by the model’s large layer protocols to pass through most of the current anomaly
state space, a novel forward algorithm is derived for the online
implementation of the model based on the M-algorithm. Entropy
detection systems designed for low layers and arrive at the
of the user’s HTTP request sequence fitting to the model is used as victim webserver. (ii) App-DDoS attacks usually depend on
a criterion to measure the user’s normality. Finally, experiments successful TCP connections, which makes the general defense
are conducted to validate our model and algorithm. schemes based on detection of spoofed IP address useless.
Index Terms—Anomaly detection, browsing behaviors, DDoS, (iii) Flooding is not the unique way for the App-DDoS attack.
hidden semi-Markov Model, M-algorithm. There are many other forms, such as consuming the resources
of the server (CPU, memory, hard disk, or database), arranging
the malicious traffic to mimic the average request rate of the
I. INTRODUCTION normal users or utilizing the large-scale botnet to produce the
low rate attack flows. This makes detection more difficult.
D ISTRIBUTED denial of service (DDoS) attacks consti-
tute one of the major threats and are among the hardest
security problem facing today’s Internet [1]. Because of the se-
From the literature, few studies can be found that focus
on the detection of App-DDoS attacks. This paper develops
riousness of this problem, many defense mechanisms, based on a novel model to capture the browsing patterns of web users
statistical approaches [2]–[11], have been proposed to combat and to detect the App-DDoS attacks. Our contributions are
these attacks. in three aspects: (i) based on the hidden semi-Markov model
In statistical approaches to defend against DDoS attacks, the (HsMM) [13], [14], a new model is introduced for describing
statistics of packet attributes in the headers of IP packets, such the browsing behavior of web users and detection of the
as IP address, time-to-live (TTL), protocol type, etc., are mea- App-DDoS attacks; (ii) a new effective algorithm is proposed
sured and the packets deemed most likely to be attack, , are for implementation of the forward process of HsMM and
dropped based on these measurements. Such approaches often on-line detection of the App-DDoS attacks; and (iii) exper-
assume that there are some traffic characteristics that inherently iments based on real traffic are conducted to validate our
distinguish the normal packets from the attack ones. Therefore, detection method.
“abnormal” traffic can be detected based on those traffic char- This paper is organized as follows. In Section II, we describe
acteristics during a DDoS attack. the related works of our research. In Section III, we present the
Although the approaches based on statistics attributes of assumptions made in our scheme and establish a new model
TCP or IP layers are valuable for certain DDoS attacks (e.g., for our scheme in Section IV. Then, in Section V, we propose a
SYN/ACK flooding), they are not always workable for some novel detection algorithm based on HsMM. The whole scheme
special DDoS attacks which work on the application layer. is validated in Section VI by conducting an experiment using
This has been witnessed on the Internet in 2004, when a worm real data of traffic. Finally, in Section VII, we discuss some
virus named “Mydoom” [12] used pseudo HTTP requests to important issues on the scheme and conclude our work in
attack victim servers by simulating the behavior of browsers. Section VIII.
to specific DDoS attacks based on the Management Informa- The main disadvantage of the first type of methods is that
tion Base (MIB) [5]. Limwiwatkul [6] discovered the DDoS they do not take into account the user’s series of operations in-
attacking signature by analyzing the TCP packet header against formation (e.g., which page will be requested in the next step).
the well-defined rules and conditions. Noh [7] attempted to Therefore, they cannot describe the browsing behavior of a user
detect attacks by computing the ratio of TCP flags to TCP because the next page the user will browse is primarily de-
packets received at a webserver and considering the relative termined by the current page he/she is browsing. The second
proportion of arriving packets devoted to TCP, UDP, and ICMP type of methods need intensive computation for page content
traffic. Basu [8] proposed techniques to extract features from processing and data mining, and hence they are not very suit-
connection attempts and classify attempts as suspicious or not. able for on-line detection. On the other hand, the click-streams
The rate of suspicious attempts over a day helped to expose data sets required for the algorithms are collected by additional
stealthy DoS attacks, which attempt to maintain effectiveness means (e.g., cookies or JAVA applets), which require the sup-
while avoiding detection. port of clients. The third type of methods omit the dwell time
Little existing work has been done on the detection of that the user stays on a page while reading and they do not con-
App-DDoS attacks. Ranjan et al. [9] used statistical methods to sider the cases that a user may not follow the hyperlinks pro-
detect characteristics of HTTP sessions and employed rate-lim- vided by the current page and some user’s HTTP requests re-
iting as the primary defense mechanism. Besides the statistical sponded by proxies cannot arrive at the server. Besides, when
methods, some researchers combated the App-DDoS attacks the total number of pages (i.e., the state space of Markovian) in
by “puzzle”. Kandula [10] designed a system to protect a web the website is very large, the algorithm would be too complex
cluster from DDoS attacks by (i) designing a probabilistic to be implemented in online detection. The last type of methods
authentication mechanism using CAPTCHAs (acronym for merely focus on system calls of a UNIX program, instead of
“Completely Automated Public Turing test to tell Computers web user browsing behavior. Obviously, system calls can be
and Humans Apart”) and (ii) designing a framework that opti- recorded completely by the log, but the observations of web
mally divides the time spent in authenticating new clients and users’ requests are incomplete because many HTTP requests
serving authenticated clients. are responded by proxies or caches, which cannot arrive at the
However, statistical methods can hardly distinguish the server and be recorded in the log. Thus, a new method designed
vicious HTTP requests from the normal ones. Hence, such for detection of App-DDoS attacks has to consider these issues.
methods can merely be used for anomaly monitoring but not
for filtering. Defenses based on puzzle, requiring each client to III. MODEL PRELIMINARIES AND ASSUMPTIONS
return the solution before accessing the site, are not effective The proposed scheme for App-DDoS attack detection and fil-
because (i) CAPTCHAs might annoy users and introduce tering is based on two results obtained by web mining [22]: (i)
additional service delays for legitimate clients; (ii) puzzles about 10% of the webpages of a website may draw 90% of the
also have the effect of denying web crawlers access to the access; (ii) web user browsing behavior can be abstracted and
site and disabling search engines which index the content. profiled by users’ request sequences. Thus, we can use a uni-
Furthermore, new techniques may render the puzzles solvable versal model to profile the short-term web browsing behavior,
using automated methods [16]. In [17], Jung et al. proposed and we only need the logs of webserver to build the model
methods to filter the HTTP-level offending DoS traffic. They without any additional support from outside of the webserver.
provided topological clustering heuristics for a web server to Another assumption of the proposed scheme is that it is dif-
distinguish denial of service attacks from flash crowd behavior ficult for an attacker or virus routine to launch an App-DDoS
according to the source IP address. However, IP addresses are attack by completely mimicking the normal web user access
subject to spoofing and using a white-list of source addresses behavior. This assumption is based on the following consider-
of legitimate clients/peers is difficult, since many hosts may ations. Usually, browsing behavior can be described by three
have dynamic IP addresses due to the use of NAT, DHCP, and elements: HTTP request rate, page viewing time, and requested
mobile-IP. sequence (i.e., the requested objects and their order). In order
web user behavior mining has been well studied. The existing to launch stealthy attacks, attackers can mimic the normal ac-
works can be summarized as the following four types. The first cess behavior by simulating the HTTP request rate, the viewing
is based on probabilistic model, e.g., [18], which use a Zipf-like time of webpage, or the HTTP request sequence. The first two
distribution to model the page jump-probability, a double Pareto simulations can be carried out by HTTP simulation tools (e.g.,
distribution for the link-choice, and a log-normal distribution [23]).The HTTP attack traffic generated this way has similar sta-
for the revisiting, etc. The second is based on click-streams and tistical characteristics as the normal ones, but it cannot simulate
web content, e.g., [19], which uses data mining to capture a web the process of the user browsing behavior because it is a pre-de-
user’s usage patterns from the click-streams dataset and page signed routine and cannot capture the dynamics of the users
content. The third is based on the Markov model, e.g., [20], and the networks. Besides, the low rate, stealthy, attack requires
which uses Markov chains to model the URL access patterns more infected computers, which needs the help of a large-scale
that are observed in navigation logs based on the previous state. botnet. Another option for attack is simulating the HTTP re-
The fourth applies the user behavior to implement anomaly de- quest sequence. Four options can be used to carry out this in-
tection, e.g., [21], which uses system-call data sets generated by tention, including stochastic generation, presetting, interception
programs to detect the anomaly access of UNIX system based with replaying, and remote control. Using the first method, at-
on data mining. tackers (routines) generate the HTTP request objects randomly,
56 IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. 17, NO. 1, FEBRUARY 2009
(13)
(1)
(2) where , , , , and is the
Null State.
(3) Now we yield the estimation formulas based on these
variables. Considering the observation variables and
The backward variables and the backward formulas: in are independent of each other, then we have
. Furthermore, because our anomaly
detection method will have multiple observation sequences
(4) on multiple user behaviors, we derive the re-estimation algo-
rithm of the HsMM for multiple observation sequences by
the frequency in this paper. Assuming there are observation
sequences with different lengths , we obtain
(5)
the extended re-estimation formulas:
(14)
(6)
(7) (15)
(18)
(8)
(19)
(9)
(20)
(10)
where , , , , ,
, and , , for all , , the
variables whose superscript are belong to the th user, and
(11) , , , , and are the normaliza-
tion factors, such that , ,
XIE AND YU: A LARGE-SCALE HIDDEN SEMI-MARKOV MODEL FOR ANOMALY DETECTION ON USER BROWSING BEHAVIORS 59
VII. DISCUSSION
Several related issues of the proposed scheme are discussed
in this section.
Fig. 15. Varieties of users’ likelihood. (a) Entropy versus index of request for normal users. (b) Entropy versus time for normal users. (c) Entropy versus index of
request for attack sequences. (d) Entropy versus time for attack sequences.
ACKNOWLEDGMENT
The authors would like to thank the anonymous referees for
their helpful comments and suggestions to improve this work.
They also thank the Network Center of Sun Yat-Sen University
for their help with the access logs.
based on the Central Limit Theorem. That is, given a distribu- REFERENCES
tion with a mean and variance, the sampling distribution ap-
[1] C. Douligeris and A. Mitrokotsa, “DDoS attacks and defense mech-
proaches a Gaussian distribution. Thus, we can describe entropy anisms: Classification and state-of-the-art,” Computer Networks: The
distribution of the training data using the following Gaussian Int. J. Computer and Telecommunications Networking, vol. 44, no. 5,
distribution: pp. 643–666, Apr. 2004.
[2] J. Mirkovic, G. Prier, and P. L. Reiher, “Attacking DDoS at the
source,” in Proc. 10th IEEE Int. Conf. Network Protocols, Sep. 2002,
pp. 312–321.
[3] C. Jin, H. Wang, and K. G. Shin, “Hop-count filtering: An effective
where is the mean and is the variance. In considering that defense against spoofed traffic,” in Proc. ACM Conf. Computer and
Communications Security, 2003, pp. 30–41.
error level could give us a confidence interval of 99.7%, it [4] T. Peng, K. R. mohanarao, and C. Leckie, “Protection from distributed
should be good enough for even high precision detection sce- denial of service attacks using history-based IP filtering,” in Proc. IEEE
narios. Table II lists the detection threshold setting and their Int. Conf. Communications, May 2003, vol. 1, pp. 482–486.
[5] J. B. D. Cabrera et al., “Proactive detection of distributed denial of
corresponding FPR and DR of our experiments (the mean and service attacks using MIB traffic variables a feasibility study,” in Proc.
variance have been given in Section VI). As indicated in this IEEE/IFIP Int. Symp. Integrated Network Management, May 2001, pp.
table, the detection level could be safely selected to be 609–622.
[6] L. Limwiwatkul and A. Rungsawangr, “Distributed denial of service
and this choice ensures us with FPR smaller than 2.0% and DR detection using TCP/IP header and traffic measurement analysis,” in
larger than 97%. Int. Symp. Communications and Information Technologies 2004 (ISCIT
2004), Sappom, Japan, Oct. 29, 2004.
VIII. CONCLUSIONS AND FUTURE WORK [7] S. Noh, C. Lee, K. Choi, and G. Jung, “Detecting distributed denial of
service (DDoS) attacks through inductive learning,” Lecture Notes in
In this paper, we focused on the detection of App-DDoS at- Computer Science, vol. 2690, pp. 286–295, 2003.
[8] R. Basu, K. R. Cunningham, S. E. webster, and P. R. Lippmann, “De-
tacks and presented a new model to describe browsing behavior tecting low-profile probes and novel denial of service attacks,” in Proc.
of web users based on a large-scale Hidden semi-Markov 2001 IEEE Workshop on Information Assurance and Security, Jun.
Model. A new on-line algorithm based on the M-algorithm 2001, pp. 5–10.
[9] S. Ranjan, R. Swaminathan, M. Uysal, and E. Knightly, “DDoS-re-
was designed for the anomaly detection. A set of real traffic silient scheduling to counter application layer attacks under imperfect
data collected from an educational website and a generated detection,” in Proc. IEEE INFOCOM, Apr. 2006 [Online]. Available:
App-DDoS attack traffic were used to validate our model. The http://www-ece.rice.edu/~networks/papers/dos-sched.pdf
[10] S. Kandula, D. Katabi, M. Jacob, and A. W. Berger, Botz-4-Sale: Sur-
experiment results showed that when the detection threshold of
viving organized DDoS attacks that mimic flash crowds Mass. Inst.
entropy is set as , the DR is 98% and the FPR is 1.5%, Technol., Tech. Report TR-969, 2004 [Online]. Available: http://www.
which demonstrated that the model could be used effectively usenix.org/events/nsdi05/tech/kandula/kandula.pdf,
to describe the browsing behavior of normal web users and [11] R. K. C. Chang, “Defending against flooding-based distributed de-
nial-of-service attacks: A tutorial,” IEEE Commun. Mag., pp. 43–51,
detect the App-DDoS attacks. We also showed that the new Oct. 2002.
algorithm can reduce the memory requirement and improve the [12] MyDoom virus. [Online]. Available: http://www.us-cert.gov/cas/
computational efficiency. techalerts/TA04-028A.html
[13] S.-Z. Yu and H. Kobayashi, “An efficient forward-backward algorithm
Several issues will need further research: 1) coping with the for an explicit duration hidden Markov model,” IEEE Signal Process.
attacks launched by dynamic webpage (e.g., script); 2) detecting Lett., vol. 10, no. 1, pp. 11–14, Jan. 2003.
XIE AND YU: A LARGE-SCALE HIDDEN SEMI-MARKOV MODEL FOR ANOMALY DETECTION ON USER BROWSING BEHAVIORS 65
[14] S. Z. Yu, Z. Liu, M. Squillante, C. Xia, and L. Zhang, “A hidden [25] K. Jiejun et al., “Random flow network modeling and simulations for
semi-Markov model for web workload self-similarity,” in Proc. 21st DDoS attack mitigation,” in Proc. IEEE Int. Conf. Communications
IEEE Int. Performance, Computing, and Communications Conf. (ICC ’03), May 2003, vol. 1, pp. 487–491.
(IPCCC 2002), Phoenix, AZ, Apr. 2002, pp. 65–72. [26] X. Yi and Y. Shunzheng, “A dynamic anomaly detection model for web
[15] J. B. Anderson and S. Mohan, “Sequential coding algorithms: A survey user behavior based on HsMM,” in Proc. 10th Int. Conf. Computer Sup-
and cost analysis,” IEEE Trans. Commun., vol. COM-32, pp. 169–176, ported Cooperative Work in Design (CSCWD 2006), Nanjing, China,
Feb. 1984. May 2006, vol. 2, pp. 811–816.
[16] G. Mori and J. Malik, “Recognizing objects in adversarial clutter:
Breaking a visual captcha,” in Proc. IEEE Computer Society Conf.
Computer Vision and Pattern Recognition, Jun. 2003, vol. 1, pp.
134–141.
[17] J. Jung, B. Krishnamurthy, and M. Rabinovich, “Flash crowds and de-
nial of service attacks: Characterization and implications for CDNS Yi Xie received the B.S. and M.S. degrees from Sun
and web sites,” in Proc. 11th IEEE Int. World Wide web Conf., May Yat-Sen University, Guangzhou, China.
2002, pp. 252–262. From July 1996 to July 2004, he was working
[18] S. Bürklen et al., “User centric walk: An integrated approach for mod- with the Civil Aviation Administration of China
eling the browsing behavior of users on the web,” in Proc. 38th Annu. (CAAC). He returned to academia in 2004. He is
Simulation Symp. (ANSS’05), Apr. 2005, pp. 149–159. currently working toward the Ph.D. degree at Sun
[19] J. Velásquez, H. Yasuda, and T. Aoki, “Combining the web content and Yat-Sen University. He was a visiting scholar at
usage mining to understand the visitor behavior in a web site,” in Proc. George Mason University during 2007 to 2008. His
3rd IEEE Int. Conf. Data Mining (ICDM’03), Nov. 2003, pp. 669–672. research interests are in network security and web
[20] D. Dhyani, S. S. Bhowmick, and W.-K. Ng, “Modelling and predicting user behavior model algorithms.
web page accesses using Markov processes,” in Proc. 14th Int. Work-
shop on the Database and Expert Systems Applications (DEXA’03),
2003, pp. 332–336.
[21] X. D. Hoang, J. Hu, and P. Bertok, “A multi-layer model for anomaly
intrusion detection using program sequences of system calls,” in Proc. Shun-Zheng Yu (M’08) received the B.Sc. degree
11th IEEE Int. Conf. Networks, Oct. 2003, pp. 531–536. from Peking University, Beijing, China, and the
[22] M. Kantardzic, Data Mining Concepts, Models, Methods And Algo- M.Sc. and Ph.D. degrees from the Beijing University
rithm. New York: IEEE Press, 2002. of Posts and Telecommunications, China.
[23] J. Cao, W. S. Cleveland, Y. Gao, K. Jeffay, F. D. Smith, and M. Weigle, He was a visiting scholar at Princeton University
“Stochastic models for generating synthetic HTTP source traffic,” in and IBM Thomas J. Watson Research Center during
Proc. IEEE INFOCOM 2004, vol. 3, pp. 1546–1557. 1999 to 2002. He is currently a Professor at the
[24] A. Sarika, A. Saumya, and G. Bryon, “DDoS attack simulation, mon- School of Information Science and Technology,
itoring, and analysis,” CS 590D: Security Topics in Networking and Sun Yat-Sen University, Guangzhou, China. His
Distributed Systems Final Project Report, Apr. 29, 2004, Purdue Uni- recent research interests include networking, traffic
versity, West Lafayette, IN. [Online.] Available: http:// www.cs.purdue. modeling, anomaly detection and algorithms for
edu/homes/ bgloden/DDoS_Attack_Simulation.pdf. hidden semi-Markov models.