Вы находитесь на странице: 1из 36

MS Thesis (Presentation)

Comparative Study on Feature Space


Reduction techniques for Spam Detection

• Researchers: Nouman Azam, Dr. Amir Hanif Dar


Samiullah marwat
The Problem
• Email is the most widely used medium for
communication world wide
– Its Cheap, Reliability, Fast and easily
accessible.
• it is prone to spam emails. Why.?
– Due to its wide usage and cheapness
– With a single click you can communicate with
any one any where around the globe.
• It hardly cost spammers to send out 1
million emails than to send 10 emails
Statistics of spam
• Is spam really a problem.?
– Some statistic will clarify.
• At the end of 2002,
– as much as 40% of all email traffic consisted of spam.
• (http://zdnet.com.com/2100-1106-955842.html)

• In 2003
– the percentage was estimated to be about 50% of all
emails
• (http://zdnet.com.com/2100-1105_2-1019528.html)

• In 2006
– BBC news reported 96% of all emails to be spam.
• (http://news.bbc.co.uk/2/hi/technology/5219554.stm)
Statistics of Spam
Daily Spam emails sent 12.4billion

Daily Spam received per person 6


Annual Spam received per person 2,200

Spam cost to all non-corporate Internet $255 million


users
Spam cost to all U.S. Corporations in 2002 $8.9 billion

Email address changes due to Spam 16%

Annual Spam in 1,000 employee company 2.1 million

Users who reply to Spam email 28%

http://spam-filter-review.toptenreviews.com/spam-statistics.html
Statistics of Spam

http://www.junk-o-meter.com/stats/index.php
Problems from Spam
• Wastage of network resources
– bandwidth
• Wastage of time
– wasting peoples time working in organizations
resulting in reduce productivity.
• Demages to PC’s
– Computer viruses can cause serious damages to
PC’s.
• Ethical issues
– Spam emails advertising pornographic sites can
cause problems for children's.
Definition of Spam
• Unsolicited (unwanted) email for a
recipient.
OR
• Any email that the user do not wanted to
have in his inbox.
Existing Approaches
• Rule based
– hand made rules for detection of Spam made by experts.(
Needs domain experts and constant updating of rules).
• Customer Revolt
– forcing companies not to publicize personal email ids given
to them. (Hard to implement)
• Domain filters
– Allowing mails from specific domains only. (hard job of
keeping track of domains that are valid for a user. )
• Blacklisting
– Blacklist filters use databases of known abusers, and also
filter unknown addresses. (constant updating of the data
bases would be required).

http://www.templetons.com/brad/spam/spamsol.html
Existing Approaches
• Whitelist Filters
– Mailer programs learn all contacts of a user and let mail from those
contacts through directly. ( Every one should first be needed to
communicate his email id to the user and only then he can send email)
• Hidding address
– hidding ones original address from the spammers by allowing all emails
to be received at temporary email id which is then forwarded to the
original email if found valid by the user. (hard job of mainting couple of
email ids).
• Checks on number of recipients by the email agent programs.
• Government actions
– Laws implemented by government against spammers (Hard to
implement laws).
Lastly
• Automated Recognition of Spam
– Uses machine learning algorithms by first learning from the past data
available. (Seems to be the best at Current).

http://www.templetons.com/brad/spam/spamsol.html
Why automated Spam Detection is
Best
• Minimum user input taken
– The filter will filter Spam automatically with
minimum user input.
• Adaptation to new kinds of spam
– The filter can adopt itself with the newly
unknown kinds of spam. i.e. it will learn and
update it self automatically.
Nature of the Problem
• Instance of document classification
– it can be considered as a simple instance of
document classification problem where we have two
classes and our objective is to separate spam from
legitimate emails.
• The features in our domain will be words.
• Representation of emails
– Any email can be represented in terms of features
(taken to be words in this case) with discrete values
based on some statistics of the presence or absence
of words
Main Steps
Preprocessing of Data
• Removal of words that have length lesser than 3
– All those words whose length were found to be lesser
in length than 3 were removed as they were found to
be mostly non informative.
• Removal of stop words
– Stop words are those which provide structure of the
language and do not provide the content.
– Not informative towards the class of the document.
– Examples are pronouns and conjectives
• Performing Stemming with Porter Stemming
algorithms (Porter 1980).
– Stemming reduces the words having the same stems
to single words thus reducing the vocabulary.
Preprocessing
• Some stop words • Examples of
Stemmed words

Ling spam corpus after the pre processing


Representation of Data
example #1 example #2 example #3 example #4 …

Feature #1

feature #2

feature #3

feature #4

……
Representation of Data
• Term Frequency (TF)

wij  tf Wij = weight of a


ijterm i in email j,

tfij = frequency of a
term I in email j
The Corpus (Data Set)
• The corpus that we used in our
experimentation was Ling Spam Corpus
(Androutsopoulos. et al. 00)
• Total number of legitimate emails in the
corpus were 2412
• Total number of spam emails were 481.
• Spam percentage was about 16%
Why Feature Reduction?
• Text classification tasks are driven by high
dimensionality
– Total number of unique features i.e. words in the
entire corpus was found out to be over 40 thousands.
• Hard job
– Computation in over 40 thousands dimensions would
be very difficult
– Secondly the storage requirements will be very huge.
– It should be noted that storing an array of 2890 *
40,000 requires approximalty over 220 MB to be
stored in CSV format.
Feature Reduction Methods
• Mutual Information (MI)
• Latent Semantic indexing (LSI, PCA or KLT)
• Word Frequency Thresh holding (TF)
Mutual Information
• Supervised feature selection method
• MI for feature t can be calculated as
 P(t,c) 
MI (t , c)    P(t,c)  log  
c( Spam , Leg ) t(0,1)  P(t)  P(c) 
Where t = terms or features and c = class
MI Scores for all of the terms (features) were
calculated and then features were sorted in
descending order and top scoring features
were selected. (Sahami. et al.98)
Term Frequency Thresh holding
• Unsupervised feature selection
• Term frequency
– TF for a feature in a document is the number
of times it appear in that document.
• Term frequency score of a feature
– TF Score for a feature is the addition of the
individual term frequencies for that feature in
the entire set of documents (emails).
Latent Semantic Indexing
• Unsupervised feature extraction
• Also known as Principal Component Analysis and
Karhunen-Loève transform
• It calculates the Eigen vectors EV of the
covariance matrix C which is obtained from the
multiplication of the mean adjusted data µ with its
transpose.
• The Eigen vectors corresponding to the top most
Eigen values are selected.
• Transformed data TD is obtained by taking the
Transpose of the Eigen vectors matrix and
multiplying it with the mean adjusted data i.e.
• TD = EV` * µ
(Günal. et al. 05)
The Classifier
• The classifier used was K-Nearest
neighbor.
• All the data were stored in the memory.
• Classification of new example would be
carry out by finding Its Euclidean distance
from all the stored data. The ones with the
nearest distance would be the class of the
new data.
(Androutsopoulos. et al. 00)
The Classifier
Experimental settings
• In the first set the data was represented
using Term frequency.
• Three algorithms were tested.
• MI, LSI and TF thresh holding
• All three algorithms were used to select
the top most 20,50,100 and 250 features.
Evaluation Measures
• Accuracy
– let N Spam and N Leg be the total number of spam and
legitimate emails in our data set.
– let NY  Z be the number of emails that are classified as Z but
belong to class Y. then
Acc  N SpamSpam  N Leg  Leg N Spam  N Leg
WERR  N Legit  Spam  N Spam Legit N Spam  N Leg
– Identifying legitimate email as spam is more costly then identifying
spam as legitimate. To cope with this cost different we redefine
accuracy as weighted accuracy and error as weighted error as
WAC   .N Leg  Leg  N Spam Spam N Spam  N Leg
WERR  .N Legit  Spam  N Spam Legit N Spam  N Leg
Evaluation Measures
• Spam Recall
– If we consider identification of spam as a filtering process
and filter out all of the identified spam from the legitimate
ones than.
– Spam recall measures the percentage of spam messages
that the filter manages to block
N Spam  Spam
SR 
N Spam
• Spam precision
– measures the degree to which the blocked messages are
indeed spam
N Spam Spam
SP 
N Spam Spam  N Legit  Spam
(Androutsopoulos. et al 00, sahami et al)
Experimental Results (1)
Thresh MI(Entire data MI(individual
NoFeature LSI (PCA) holding set) files)
WAC WAC WAC WAC WAC WAC WAC WAC WAC WAC WAC WAC
λ=9 λ λ=999 λ=9 λ=99 λ=999 λ=9 λ λ=99 λ=9 λ=99 λ=999
(%) = (%) (%) (%) (%) (%) = 9 (%) (%) (%)
9 9 (%)
9 9
20 94 94.1 94.1 94.5 94.8 94.9 97.5 98 98 96.4 96.9 96.9
(%) (%)

50 92.7 92.8 92.8 94.7 94.9 94.9 97.5 98 98 95.4 95.9 95.9

100 90.7 90.7 90.7 93.2 93.3 93.4 96.4 96.8 96.8 93.9 94.3 94.4

250 88.5 88.5 88.5 90.4 90.5 90.6 93.7 94.1 94.1 92.8 93.1 93.1

500 - - - 91 91.1 91.1 - - - - - -

Weighted Accuracy Results with k = 1


Experimental Results (1)
No Thresh MI(individual
Featu
re LSI (PCA) holding MI(Entire data) File)
WAC WAC WAC WAC WAC WAC WAC WAC WAC WAC WAC WAC
λ=9 λ λ=999 λ=9 λ λ=999 λ=9 λ λ=999 λ=9 λ=99 λ=999
(%) = (%) (%) = (%) (%) = (%) (%) (%) (%)
9 9 9
20 92.9 92.99 92.9 94.3 94.69 94.6 97.5 989 98 96 96.5 96.5
(%) (%) (%)
50 91.3 91.3 91.3 93.2 93.4 93.4 97.8 98.3 98.3 95.5 95.9 96

100 89 89 89 91.9 92.1 92.1 96.3 96.7 96.8 93.8 94.2 94.2

250 88.5 88.6 88.6 90.8 91 91 95.1 95.6 95.6 92.7 93.1 93.2

500 - - - 90 90.1 90.1 - - - - - -

Weighted Accruacy Results with k = 3


Spam Recall (%)
Experimental Results (1)
95 L S I
Thresh holding
MI(Entire Data)
90 MI(Indivisual f ile)

85

80

75

70
0 100 200 300
Fea tures

Spam Recall values for K = 1


Spam Recall (%)
Experimental Results (1)
95 L S I
Thresh holding
MI(Entire data)
90 MI(indivisual f ile)

85

80

75

70
0 100 200 300
Fe a ture s

Spam Recall values for K = 3


Spam Precision(%)
Experimental Results (1)
95 L S I
Thresh holding
MI(Entire Data)
90
MI(Indivisual File)

85

80

75

70
0 100 200 300
Fe a ture s

Spam Precision values for K = 1


Spam Precision(%)
Experimental Results (1)
95 L S I
Thresh holding
MI(Entire Data)
90 MI(Indivisual File)

85

80

75

70
0 100 200 300
Fe atures

Spam Precision values for K = 3


Summary of Results from
Experiment
• MI performs well with accuracy.
• MI Scores calculated over the entire data set
performs better than MI scores calculated on the
individual files.
• LSI and TF Thresh holding performs well in
Spam Recall but is out performed by MI with
Spam Precision.
• LSI and TF Thresh holding have similar sort of
results
Observation
• Changing the Values of K for the Nearest Neighbor
– does not have significant impact on the results.
• Value of K
– from 1 to 7 can give you approximalty the same results.
• feature set size and accuracy
– There isn’t any consistent relationship
• Changing the values of λ
– from 9 to 999 (in the weighted accuracy equation) improves
the accuracy from 0.5% to 1.5% on average.
• The Best accuracy results
– Against the lower feature sets. Which is great improvement
over the original feature space of over 40 thousand features.
Future work
• Minimum feature set size
– I was unable to find the minimum feature set size after
which the performance starts degrading.
• Other features of email
– The corpus I used does not have other features of emails
such as attachments, pictures, domain properties etc.
adding these as a features will have a good impact on
accuracy and has been examined in (sahami et al 97).
• Spam rate of corpus
– The Spam rate of the corpus was about 16%. Which should
be more. Increasing the Spam rate to 70% or 80% might
improve the performance in terms of spam recall and
precision and will be actually depicting the current spam
rate.

Вам также может понравиться