Вы находитесь на странице: 1из 4

2011 Frontiers of Information Technology

2011 Frontiers of Information Technology

DOFUR: DDoS Forensics Using mapReduce

Rana Khattak, S hehar Bano, Shujaat Hussain, Zahid Anwar


SEECS, NUST,
Islamabad, Pakistan.
{10msccsmkhattak; 10msccsabano; shujaat.hussain; zahid.anwar} @seecs.edu.pk

Abstract-Currently we have seen a very sharp increase in The type of attack that we will be dealing with is DDoS
network traffic. Due to this increase, the size of attack log files attack. Distributed DoS attacks are a major security concern
has also increased greatly and using conventional techniques to these days. DDoS attacks are launched to compromise the
mine the logs and get some meaningful analyses about the
availability of a system or a network like DoS but unlike
DDoS attackers location and possible victims has become
DoS. the attack is launched by the adversary by creating
increasingly difficult. We propose a technique using Hadoops
MapReduce to deduce results efficiently and quickly which zombies that send several requests to the victim and
would otherwise take a long time if conventional means were overwhelm it with large amount of requests. This creates a
used. The aim of this paper is to describe how we designed a bottleneck and the victim can no further entertain requests
framework to detect those packets in a dataset which belong to a from legitimate users, denying service to them.
DDoS attack using MapReduce provided by Hadoop. During DDOS attacks, the log files swell up to huge
Experimental results using a real dataset show that sizes. These log files if analysed properly and effectively
parallelising DDoS detection can greatly improve efficiency. can help detect and recover from a DDoS attack. These log
files can take a long time if processed through conventional
Keywords - DDoS, mapReduce
means thus delaying the results due to which the recovery
I. INTRODUCTION phase could be delayed. As the attacker is progressing and
devising new ways to commit crime and intrude, our
Denial of Service (DoS) attack is launched to make an
solution to the problem emphasis is on the fact that the
internet resource unavailable often by overwhelming the
forensic community also needs to get smarter and instead of
victim with a large number of requests. DoS attacks can be
using traditional ways for detecting advanced attacks,
categorized on the basis of single source and multi source. should focus on using advanced techniques that are efficient
Multi source attacks are called distributed Dos or DDoS
and helps detect intrusions in real time.
attacks.
MapReduce is a model provided by Hadoop that is used
Today computers and networks are attacked very for parallel processing of distributed data[1]. MapReduce
frequently and these attacks can be very expensive and hard often consists of several distributed computing machines
to recover from. This leads to need of computer forensics,
working together in the form a cluster. There is a cluster
which can help us identify the following: are we under
head or master machine and several mapper and reducer
attack? Who is attacking us? And which incoming traffic is machines. Tasks are assigned and managed by the master.
malicious or a part of the attack? Evidence of such
MapReduce is a 2-step process. The map phase is the first
intrusions is required in case the affected wants to pursue
processing step in which the input hadoop file system
the court and legal action is to be taken against the
(HDFS) is broken into several splits and each split is
adversary.
assigned to a mapper. These splits are the n parallelly
These forensics investigations are very hard and often
processed by the mappers. In the Reduce Phase the
done manually in many cases [1]. Information about
intermediate results provided by the map phase are
network, web and different protocols and applications are
summarized and associated records are processed by single
saved in log files. These log files usually save everything
reducer.
and anything indiscriminately. An intelligent attacker can
Since in DDoS attac ks large data sets have to be analyzed
mask his attac ks by mixing it up with legitimate requests.
quickly to take decision about suspicious activities,
MapReduce can prove to be a very good solution. We tested
our forensics system with the data set provided by Lincoln
labs[1] and showed that DDoS forensics on Hadoop
MapReduce can produce results much efficiently than a
serial analyser. We propose distributed forensics for
distributed attacks.
Section 2 represents work of others related to ours.
Section 3 describes the design and architecture of our work.
Section 4 gives the details about the imple mentation of our
work. Section 5 is related to the evaluation of our strategy.
Section 6 concludes the report and defines areas of future
work on the project.

Figure 1: DDOS attack

978-0-7695-4625-4/ 11 $26.00 2011 IEEE 117


978-0-7695-4625-4/
DOI 11 $26.00 2011 IEEE
10.1109/FIT.2011.29 117
DOI 10.1109/FIT.2011.29
traditional means of evidence ga thering but still it is not
100% and there is room for improvement.
A tracing back technique has also been proposed in [7]
for digital forensics. Both of these technique s need to be
appreciated for making an effort and improving the
methodology of network forensics but they still lack behind
in comparison to the adversarys speed and intelligence.
We, in our project have used the same datasets and attack
situation assumptions as provided by the Lincoln
laboratories. And correctly identified those packets that
belong each phase of the attack.

III. DESIGN AND ARCHITECTURE


The MapReduce cluster was provided with a Distributed
File System (DFS) containing a large set of attack data from
the experiments carried out by the Lincoln Lab in 2000 for
DARPA. The master distributes different sets of data among
different mappers and the intermediate results are stored at
the mappers. The master then assigns the task of extracting
information regarding the attack data to the reducers. And
Figure 2: Analyzing data collected at firewall through the final results are stored to the distributed file system.
The data files contain the information of all the packets
MapReduce Hadoop
that has been sent from or received by a network. Each
packets information included the time stamp, packet
II. RELATED WORK
length, source IP, destination IP, source port, destination
Lincoln laboratories carried out similar forensics [5]. port, flag (if the packet is a syn/ack/fyn/reset), and the
They created an example attack scenario, implemented it, protocol (if the connection is tcp or udp) for each packet.
carried out a DDOS attack and captured the attack traffic at All DDoS attacks follow a similar lifecycle with minor
Eyrie Air Force Base, the site where zombies were created variations:
by the adversary. Their attack scenario consisted of 5 1- IP Sweep
phases. 2- Port scan
The attacker sends ICMP echo requests to all the subnets 3- Vulnerability exploit
of the Air force Base to determine which hosts are alive. 4- Dos attack
The hosts that were found up in the previous phase are then We maintain a data structure in which each row
probed to determine which hosts are running the Sadmind represents each host on our own network. The number of
program. This helps the attacker to know which hosts might objects in each row is variable. Each row object represents a
be vulnerable to the buffer overflow vulnerability present in connection (IP address + port) to an outside host and other
the Sadmind remote administration tool. This is done by information about it (e.g. connection state, outbound and
asking the hosts at Air Force Base for the port number of inbound direction).
Sadmind service and then trying to connect to the port
number supplied in response. Successful connections to
No. Time Source Destination
hosts are shortlisted.
Protocol Info
The hosts that were shortlisted in phase 2 are now
exploited by buffer overflow vulnerability present in them.
1 0.000000 202.77.162.213 172.16.115.20
Several attempts are made in order to create a new user
named hacker 2 on the remote machines. The attacker now TCP 49212 > telnet [SYN] Seq=0 Win=32768 Len=0
installs the mstream tool on the hosts where hacker 2 was MSS=1460 WS=0 TSV=190935 TSER=0
successfully created. Each client informs the master when it
is alive. The master keeps a record of live clients. Frame 1: 74 bytes on wire (592 bits), 74 bytes captured (592
The attack is launched by overwhelming the target victim bits)
with a huge number of connection requests by all the
mstream clients using a spoofed source address. Ethernet II, Src: 3com_9c:b2:8e (00:10:5a:9c:b2:8e), Dst:
Processing and Analyzing Large Scale Data is a project Oracle_89:ba:28 (08:00:20:89:ba:28)
in which intrusion detection has been carried out through
Hadoop MapReduce using the TRW algorithm. The Internet Protocol, Src: 202.77.162.213 (202.77.162.213),
algorithm only identifies host IP address to be malicious or
Dst: 172.16.115.20 (172.16.115.20)
not. It cannot detect spoofed IPs and does not detect those
packets that are part of the attack.
Transmission Control Protocol, Src Port: 49212 (49212),
[5] has used fuzzy logic and expert systems to detect
Dst Port: telnet (23), Seq: 0, Len: 0
networks intrusions. Their method is efficient, automatic
and reliable. Though their results show quite an increase in
the correct classification rate of log, compared to the Figure 2: Pcap file converted to a text file

118
118
As is the case with most o of the programs in Hadoop, we
also used string matching fo or malic ious packets. Mappers
identified the packets which s satisfied the conditions de fined
in the five phases and wrote t them in the file. Reduce phase
was used to accumulate the m map phase outputs.
We analyzed the data asets provided by Lincoln
laboratories, exported them t to a text file and programmed
Hadoop to identify packets in n the dataset that belong to each
attack phase. Each mapper takes a packet as input and
e mits a key, value pair for (attack phase, packet
information). Some part of ou ur code for the mapper is:

1) for(int i=0;i<3;i++){
2) word.set(tokenizer.next tToken());}
3) for(int j=0;j<IPs.length h;j++)
4) if(pkt.indexOf("ICMP") ")>0
5) && &( pkt.indexOf("(ping)")>0)
Figure 3: Data Flow and Architectu ure diagram
6) {
The dataset provided is in binary fo ormat and it was 7) pkt=pkt.substring(0, co ount);
converted into text format as shown in n figure 2. After 8) word.set("\n\n******* *************************
converting into text for mat the file was g iven to HDFS for Phase 1: IP Sweep from m a remote site
the Hadoop cluster as shown in figure 3. ******************* ***********************\n
We populate the connections table by an nalysing the dump \n No.Time Source Des st Protocol Info\n"+pkt);
packets one by one. If a particular IP ad ddress crosses the 9) keyval.set(1);
'horizontal threshold' or 'vertical threshol ld' for acceptable 10) context.write(word, key yval);
ping requests, we label it as suspicious a and pass it on as 11) }
input to the next phase to further observe it ts behaviour.
Horizontal threshold is a limit on th he number of IP This is the code for ident tifying packets that belong to
addresses allowed to ping a single host t on our network phase 1. We start our work by suspecting a particular IP
within a fixed time window. Vertical thres shold is a limit on address as the attacker and then prove him as culprit by
the number of ping requests an external ho ost sends on all of showing his participation in th he attack. In line 4, the mapper
our internal hosts within a time window searches for all those packets that are participating in ICMP
If an external host crosses the 'horizon ntal port threshold' echo requests. We see that a particular IP address produces
or the 'vertical port threshold', it can qu ualify as input to a huge number of such req quests. This suspicion is then
phase 3. Horizontal port threshold is a lim mit on the number strengthened by proving his participation in each phase of
of ports an external host tries to connect to on any of our the attack. As we can see, the e mapper then creates a pair for
interna l hosts within a time window. Vertic cal port threshold: phase and a count of one e for each occurrence of a
a limit on the number of inter nal hosts to t the ports of which suspicious packet.
a single external host may connect to withi in a time window. Each reducer gets the sorted d records according to the key
In the phase 3 if the IP addresses that came down from i.e, attack phase and sums the e counts for each phase, if it is
phase 1 and phase 2 are found to be inv volved in a lot of greater than the threshold va lue, the reducer emits ( attack
incoming traffic to a single internal host, and such packets phase, no. of packets) sum. A As an optimization, the reducer
are malformed, it might be a sign of buffe er overflow attack. .
is used as a combiner on the m map outputs
Such IP addresses are marked.
In phase 4 if the outbound connections s have spoofed IP Some part of our code for th he reducer is shown below tha t
addresses (i.e. none of the source IP addre sses belong to our checks if the number of ICM MP packets is greater than a
network) and the destination is the sa ame for all such particular threshold:
packets, it might be the case that our ne etwork hosts have 1) int sum=0;
been used as zombies to launch dos a attack against the 2) while (values.hasNext( ()) {
victim. We may deduce that the IP addres sses short listed in 3) sum+=values.next().ge et(); }
Phase 4 IPs are the main culprits behind th his attack. 4) if(sum>=threshold)
5) while(values.hasNext() () ) {
IV. IMPLEMENTATION N 6) context.write(key, sum) m)
A. Implementation strategy: 7) }
We used Java as programming langua age to implement
the traffic analyzing technique. Karmasph here is a graphical B. Challenges In Implem mentation:
environment for creating Hadoop jobs. Ra ather than dealing
The datasets provided by L Lincoln laboratories were in the
with low-level Hadoop management, Karm masphere services
form of Pcap files. We neede ed the input to be in text form
provides cluster monitoring, job monitorin ng, and file system
so we converted the Pcap file es to .txt. MapReduce splits the
management through a high-level user e environment. This
file among mappers but the p problem we faced initially was
not only saves time learning Hadoop, b but also increases
that the input reader splits th he file by size. We didn't want
productivity when using Hadoop.

119
119
any packet to be spread across several mappers. To
overcome this we created our own custom reader which
could treat each packet as a single whole object, taking file
splitting into consideration. Our code for the custom reader
is:

1) long start = split.getStart();


2) end = split.getStart() + split.getLength();
3) fsin.seek(start);
4) if (start != 0) {
5) readUntilMatch(endTag, false);
6) }
Figure 4: Comparison
7) public boolean nextKeyValue() throws IOException
{
VI. FUTURE WORK AND CONCLUSION
8) if (!stillInChunk) return false;
9) boolean status = readUntilMatch(endTag, true); As the current work converts the binary files to text files
10) value = new Text(); and does processing on them, so the next step would be
11) value.set(buffer.getData(), 0, buffer.getLength()); implementing binary input module which will take binary
12) key = new LongWritable(fsin.getPos()); files as input. We will also further generalize the solution to
13) buffer.reset();} detect the attacker and victim IPs from a data set based on
entropy calculations and also add a module to detect packets
with spoofed IPs.
In line 1, the variable split is assigned the start of our
packet, which in our case was marked by the No. token. R EFERENCES
The end of the split is calculated by adding the length of the
[1] Hadoop, Wiki, http://wiki.apache.org/hadoop/PoweredBy , date
split to the start of the split. Line 5, is the start of a function accessed: 5 June 2011.
that makes a check for the end of the file. [2] J. Lin and C. Dyer, Data-Intensive Text Processing with MapReduce,
Morgan & Claypool Publishers, 2010.
1) private boolean readUntilMatch(byte[] match, [3] T . White. 2009. Hadoop: The Definitive Guide. O'Reilly Media, Inc.
boolean withinBlock) throws IOException { June 2009.
[4] J. Dean and S. Ghemawat, MapReduce: Simplified Data Processing on
2) int i = 0; Large Cluster, OSDI, 2004
3) while (true) { [5] Niandong Liao *, Shengfeng T ian, T inghua Wang , Network forensics
4) int b = fsin.read(); based on fuzzy logic and expert syst em.
[6] Lincoln Laboratotry, Massachusetts Institite of T echnology, accessed
5) if (b == -1) return false;
on:1st June 2011,
6) if (withinBlock) buffer.write (b); http://www.ll.mit.edu/mission/communications/ist/corpora/ideval/data/200
7) if (b == match[i]) { 0/LLS_DDOS_1.0.html .
8) i++; [7] H. Achi, A. Hellany & M. Nagrial. Network Security Approach for
9) if (i >= match.length) { Digital Forensics Analysis.
[8] Apache Hadoop, ht tp://hadoop.apache.org/ , accessed: 2 June 2011.
10) return fsin.getPos() < end;
11) }
12) } else i = 0; }}

V. EVALUATION AND RESULTS


We compared the performance of our program written in
java and distributed through Hadoop and a sequential java
code having the same input file. The tests were carried out
against both the programs with different data sizes. It was
found that the sequential program performed better than
Hadoop up until a certain size because Hadoop has an
additional complexity of distributing tasks which becomes
an issue for small data sizes. But as the input sizes grew,
the performance of Hadoop grew exponentially. This was
because of the face that in Hadoop the nodes could perform
parallel and produce results in muc h less time.
These were two more large data sets we experimented
on. The size of one dataset was 911 MB. The dataset result
showed 3 malicious hosts. The 6.9 GB data did not contain
any malicious hosts. There was significant improvement in
execution time.
Both time and number of processors involved, are the
factors that can greatly speed up the execution time and
hence performance of the Hadoop MapReduce system.

120
120

Вам также может понравиться