Академический Документы
Профессиональный Документы
Культура Документы
Dr.R.Rajaram2
G.Athiappan3
Thiagarajar College of Engineering, Madurai-15
M.Muthupandian4
Thiagarajar College of Engineering, Madurai-15
ABSTRACT Email has become one of the fastest and most economical forms of communication. This paper proposes to apply classification data mining for the task of suspicious email detection based on deception theory. In this paper, email data was classified using four different classifiers (Neural Network, SVM, Nave Bayesian and Decision Tree). The experiment was performed using WEKA based on different features by which the email corpus is classified into suspicious or normal emails. Experimental results show that simple ID3 classifier which make a binary tree, will give a promising detection rates. KEYWORDS Data Mining, Decision Tree, Neural Network, Nave Bayes, SVM, WEKA.
1. INTRODUCTION
E-mail has become one of today's standard means of communication. Email data is also growing rapidly, creating needs for automated analysis. So, to detect crime a spectrum of techniques should be applied to discover and identify patterns and make predictions. Data mining has emerged to address problems of understanding ever-growing volumes of information for structured data, finding patterns within data that are used to develop useful knowledge. As individuals increase their usage of electronic communication, there has been research into detecting deception in these new forms of communication. Models of deception assume that deception leaves a footprint. Work done by various researches suggests that deceptive writing is characterized by reduced frequency of first-person pronouns and exclusive words and elevated frequency of negative emotion words and action verbs[1]. We apply this model of deception and also the novel rich features to the set of E-mail dataset and preprocess the email body and to train the system we used different Data mining -classification algorithms [5] that categorize the email as suspicious or normal.
213
1.1 Motivation
Concern about National security has increased significantly since the terrorist attacks on 11 September 2001.The CIA, FBI and other federal agencies are actively collecting domestic and foreign intelligence to prevent future attacks. These efforts have in turn motivated us to collect datas and undertake this paper work as a challenge.
214
In this paper, we identified the finest classifier through experimental study for the task of classifying emails into suspicious or normal using WEKA. In this experimental study, the three steps undergone are Email Preprocessing, Building Classifier and validation.
215
A
yes no yes no no yes yes no
B
Friend Office Relative Others Friend Relative Relative Friend
C
yes no no yes yes no yes yes
D
forward forward forward created created created forward forward
E
yes no no yes no no yes no
F
yes yes yes no yes no no no
G
Text Pdf Picture Picture Text Picture Text Picture
H
yes yes no no no yes no yes
I
no no no yes no yes yes no
J
no no yes no yes yes no yes
K
no no yes yes no yes yes no
L
no no no yes no yes yes yes
M
yes yes yes no yes no no yes
Class
Suspicious Normal Suspicious Normal Suspicious Suspicious Suspicious Normal
A->Keywords(bomb,blast,attack,hijack,etc),B->Sender (relatives, office, friends, others),C->Subject, D>Forward/Created(forward, created), E->file_att,F->Virus_scan, G->format, H,->video, I-->Audio, J>*.exe, K->periodic, L->junk, M>bulk, Class->suspicious/Normal. Bayes classification technique analyzes the relationship between each independent attribute and the dependent attribute to derive conditional probability for each relationship. A prediction is made by combining the effects of the independent variables on the dependent variable to classify a new case. We used 50% of the available Email data set (5000 emails) for training the nave bayes classifier and the remaining 50% (5000 emails) for validation (to test the performance of the Bayesian classifier) and 924 emails out of 5000 in the validation set were incorrectly classified. The accuracy of suspicious email detection with Bayesian classifier is around 76.9%.
A.NAIVES BAYES
B.NEURAL NETWORK
The classification procedure using the NN has three steps, data preprocessing, data training, and testing. Feature selection is the way of selecting a set of features, which is more informative in the task while removing irrelevant or redundant features. For the training data, the selected features from the data preprocessing steps were fed into the NN, and an email classifier was generated through the NN. The learning rate and momentum was set as 0.3 and 0.2 respectively. Out of 5000 emails in the validation set 63 emails were misclassified giving the accuracy of 98.74%. SVMs are a relatively new learning process influenced highly by advances in statistical learning theory. This classification divides two separate classes, which are generated from training examples. The overall aim is to generalize well to test data. This is obtained by introducing a separating hyper plane, which must maximize the margin () between the two classes, this is known as the optimum separating hyper plane. Out of 5000 testing sample it has correctly classified 4495 instances. The accuracy is 89.9%. J48 classifier is a simple C4.5 decision tree for classification. It creates a binary tree. The accuracy rate of detecting the suspicious emails using J48 is 96.04%.
C.SVM
Algorithm for classification is based on construction of a tree to model classification process. When the complete tree is constructed it is applied to each message and hence does classification for each. The accuracy rate for ID3 classifier is 99.4%.
4. EXPERIMENTAL RESULTS
The application of data mining to the task of suspicious email detection is done using data mining classifiers. Experiments were carried out on a small email corpus. In order to conduct an experiment setting, different sets of 10,000 emails were used. A mixture containing 5000 training data set and 5000 test sample. The training dataset is given as the input for WEKA and the classifying techniques such as decision tree (ID3 and J48), Nave Bayes, Neural Network and SVM were implemented. The experimental results show that a simple ID3 algorithm (Decision Tree Classifier) will give better classification accuracy for suspicious email
216
detection. To evaluate the classifiers on testing dataset, we defined an accuracy measure as follows. Accuracy (%) =correctly_classified_emails/Total _emails*100. An experiment measuring the performance against the size of dataset was conducted using dataset of different sizes. For example, in case of 5000 dataset, Accuracy was 99.4% using ID3 classifier.
Table 3. Classification accuracy with respect to Data size
According to the experimental study on the testing datasets, good classification result order in the experiment was ID3 classifier.
REFERENCES
[1]SAppavu alias Balamurugan, R.Rajaram & S.Senthamarai kannan, 2007.A Novel Data mining approach to detect deceptive communication in email text. Proceedings of National Conference of Advanced omputing, MIT, Chennai,India. [2]S.Appavu alias Balamurugan,R.Rajaram.et al, 2007.Association rule mining for Suspicious Email Detection: A Data mining approach. Proceedings of IEEE International Conference of Intelligence and Security Informatics, New Jersey, USA. [3]W.Cohen, et.al, 1996. Learning rules that classify email. In proceedings of the AAAI spring symposium on Machine Learning in Information Access. [4]B.Cui, A.Mondal, J.Shen, G.Cong, and K.Tan, 2005. On effiective Email classification via Neural networks. In Proceedings of DEXA,, PP.85-94. [5] Ian H.Written and Eibe Frank. Data Mining, Practical Machine Learning Tools and Techniques. [6] P.S.Keila and D.B.Skillicorn, 2005.Detecting unusual and Deceptive Communication in Email. Technical reports. [7]S.Kiritchenko, S.Matwin, and S.Abu-Hakima, 2004.Email Classification with Temporal Features. Intelligent Information Systems, pp.523-533. [8]Seongwook Youn and Dennis McLeod. A Comparative Study for Email Classification. [9]Y.Yang.et al, 2004 .An Evaluation of Statistical Approaches to Text Categorization. Journal of Information Retrieval, Vol 1, No.1/2,pp.67-88.
217