Академический Документы
Профессиональный Документы
Культура Документы
,m different classes
ModInfo(D)=
2
3
1
og
i i
i
S l S
=
3 3
1 1 2 2
log log S S S S
Where
1
S indicates set of samples which
belongs to target class anamoly,
2
S indicates set
of samples which belongs to target class normal.
Information or Entropy to each attribute is
calculated using
1
( ) / ( )
v
A i i
i
Info D D D ModInfo D
The term Di /D acts as the weight of the jth
partition. ModInfo(D) is the expected information
required to classify a tuple from D based on the
partitioning by A.
IV. Experimental Results:
RULE-7 TECHNIQUE:
==================
(word_freq_your = '(0.28698-0.770745]') and
(word_freq_money ='(0.02-INF)') and (word_freq_all =
'(0.214647-0.615166]') =>is_spam=1 (422.0/5.0)
(word_freq_free ='(0.068896-INF)') and (char_freq_! =
'(0.107811-INF)') =>is_spam=1 (372.0/15.0)
(word_freq_remove = '(0.026225-INF)') and
(word_freq_george = '(-INF-0.008661]') =>is_spam=1
(440.0/23.0)
(char_freq_$ ='(0.156751-INF)') and (word_freq_000 =
'(0.218378-INF)') =>is_spam=1 (78.0/3.0)
(char_freq_$ ='(0.156751-INF)') and (word_freq_hp ='(-
INF-0.075835]') and (capital_run_length_total =
'(0.090418-0.211566]') =>is_spam=1 (28.0/2.0)
(word_freq_hp = '(-INF-0.075835]') and
(capital_run_length_longest = '(0.041854-0.073868]')
and (word_freq_edu = '(-INF-0.047378]') and
(word_freq_george = '(-INF-0.008661]') and
(capital_run_length_total = '(0.066714-0.090418]') and
(char_freq_$ = '(0.156751-INF)') => is_spam=1
(31.0/0.0)
(char_freq_! = '(0.107811-INF)') and
(capital_run_length_average = '(0.058836-INF)') =>
is_spam=1 (45.0/3.0)
(word_freq_hp = '(-INF-0.075835]') and
(word_freq_internet = '(0.036215-INF)') and
(word_freq_edu = '(-INF-0.047378]') and
(word_freq_order = '(0.092351-INF)') => is_spam=1
(33.0/0.0)
(word_freq_hp = '(-INF-0.075835]') and
(capital_run_length_average = '(0.046493-0.058836]')
and (word_freq_george = '(-INF-0.008661]') and
(word_freq_edu = '(-INF-0.047378]') and
(capital_run_length_longest = '(0.02916-0.041854]') =>
is_spam=1 (35.0/5.0)
(word_freq_hp = '(-INF-0.075835]') and
(capital_run_length_longest = '(0.041854-0.073868]')
and (char_freq_! = '(0.107811-INF)') => is_spam=1
(31.0/2.0)
(word_freq_hp ='(-INF-0.075835]') and (word_freq_free
= '(0.068896-INF)') and (word_freq_re = '(-INF-
0.026082]') and (capital_run_length_longest =
'(0.041854-0.073868]') and (capital_run_length_average
='(0.030341-0.046493]') =>is_spam=1 (21.0/2.0)
(word_freq_hp ='(-INF-0.075835]') and (word_freq_our
= '(0.185737-INF)') and (word_freq_your = '(0.28698-
0.770745]') and (word_freq_george ='(-INF-0.008661]')
=>is_spam=1 (87.0/23.0)
(word_freq_hp = '(-INF-0.075835]') and
(capital_run_length_longest ='(0.02916-0.041854]') and
(word_freq_edu ='(-INF-0.047378]') and (char_freq_( =
'(-INF-0.010126]') and (char_freq_$ ='(0.156751-INF)')
=>is_spam=1 (11.0/0.0)
(word_freq_hp ='(-INF-0.075835]') and (char_freq_$ =
'(0.096152-0.156751]') and (char_freq_! = '(0.049475-
0.107811]') =>is_spam=1 (33.0/4.0)
(word_freq_hp = '(-INF-0.075835]') and
(word_freq_george = '(-INF-0.008661]') and
(word_freq_edu = '(-INF-0.047378]') and
(capital_run_length_longest = '(0.041854-0.073868]')
and (char_freq_( = '(0.010126-0.106447]') and
(capital_run_length_average ='(0.030341-0.046493]') =>
is_spam=1 (11.0/0.0)
(word_freq_hp = '(-INF-0.075835]') and
(capital_run_length_longest ='(0.02916-0.041854]') and
(word_freq_edu = '(-INF-0.047378]') and
(word_freq_over ='(0.212283-INF)') and (word_freq_pm
= '(-INF-0.101716]') and (word_freq_all = '(-INF-
0.214647]') =>is_spam=1 (18.0/2.0)
(word_freq_hp = '(-INF-0.075835]') and
(word_freq_george = '(-INF-0.008661]') and
(word_freq_edu ='(-INF-0.047378]') and (char_freq_! =
'(0.049475-0.107811]') and (word_freq_mail =
'(0.049675-0.327926]') and (word_freq_credit =
'(0.064194-INF)') =>is_spam=1 (7.0/0.0)
(word_freq_hp = '(-INF-0.075835]') and
(word_freq_george = '(-INF-0.008661]') and
(word_freq_free ='(0.068896-INF)') and (word_freq_edu
= '(-INF-0.047378]') and (char_freq_$ = '(0.045623-
0.096152]') =>is_spam=1 (8.0/1.0)
(word_freq_hp = '(-INF-0.075835]') and
(word_freq_george = '(-INF-0.008661]') and
(capital_run_length_longest = '(0.041854-0.073868]')
and (word_freq_650 = '(0.023453-INF)') and
(word_freq_internet ='(-INF-0.036215]') =>is_spam=1
(15.0/1.0)
(word_freq_hp = '(-INF-0.075835]') and
(word_freq_george = '(-INF-0.008661]') and
(word_freq_business ='(0.362835-INF)') =>is_spam=1
(18.0/5.0)
(word_freq_george = '(-INF-0.008661]') and
International Journal of Computer Trends and Technology (IJCTT) volume 5 number 5 Nov 2013
ISSN: 2231-2803 http://www.ijcttjournal.org Page261
(word_freq_hp ='(-INF-0.075835]') and (word_freq_re =
'(-INF-0.026082]') and (capital_run_length_average =
'(0.058836-INF)') and (word_freq_our = '(0.022361-
0.185737]') =>is_spam=1 (7.0/0.0)
(word_freq_hp = '(-INF-0.075835]') and
(word_freq_george = '(-INF-0.008661]') and
(word_freq_re ='(-INF-0.026082]') and (word_freq_font
= '(0.081988-INF)') and (char_freq_; = '(-INF-
0.128582]') =>is_spam=1 (14.0/1.0)
(word_freq_george = '(-INF-0.008661]') and
(word_freq_hp ='(-INF-0.075835]') and (word_freq_re =
'(-INF-0.026082]') and (char_freq_! ='(0.107811-INF)')
and (word_freq_will = '(-INF-0.159165]') and
(capital_run_length_longest ='(0.02916-0.041854]') and
(word_freq_meeting ='(-INF-0.178499]') =>is_spam=1
(13.0/1.0)
(word_freq_free ='(0.068896-INF)') and (char_freq_( =
'(-INF-0.010126]') and (capital_run_length_average =
'(0.058836-INF)') and (char_freq_! = '(0.049475-
0.107811]') =>is_spam=1 (5.0/0.0)
(word_freq_hp = '(-INF-0.075835]') and
(word_freq_george = '(-INF-0.008661]') and
(word_freq_edu = '(-INF-0.047378]') and
(word_freq_your = '(0.28698-0.770745]') and
(word_freq_business = '(0.095342-0.362835]') =>
is_spam=1 (7.0/1.0)
=>is_spam=0 (2811.0/122.0)
Number of Rules : 26
V. CONCLUSION AND FUTURE SCOPE
Discretization of continuous features plays an
important role in data pre-processing. This paper briefly
introduces that the generation of the problem of
discretization brings many benefits including improving
the algorithms efficiency and expanding their
application scope. There have been drawbacks in the
existing literature to classify discretization methods. The
idea and drawbacks of some typical methods are
expressed in details by supervised or unsupervised
category. Proposed Improved discretization approach
significantly reduces the IO cost and also requires one
time sorting for numerical attributes which leads to a
better performance in time dimension on rule mining
algorithms. According to the experimental results, our
algorithmacquires less execution time over the Entropy
based algorithm and also adoptable for any attribute
selection method by which the accuracy of rule mining is
improved.
REFERENCES
[1]: A DISCRETIZATION ALGORITHM BASED ON
GINI CRITERION XIAO-HANG ZHANG, JUN WU,
TING-J IE LU, YUAN J IANG, Proceedings of the Sixth
International Conference on Machine Learning and
Cybernetics, Hong Kong, 19-22 August 2007.
[2]: A Novel Multivariate Discretization Method for
Mining Association Rules Hantian Wei, 2009 Asia-
Pacific Conference on Information Processing
[3]: A Rule-Based Classification Algorithmfor Uncertain
Data, IEEE International Conference on Data
Engineering
[4]: M. C. Ludl, G. Widmer. Relative unsupervised
discretization for association rule mining. In: In
Proceedings of the 4
th
European Conference on Principles
and Practice of Knowledge Discovery in Databases,
Berlin, Germany, Springer, 2000.
[5]: Stephen D. Bay. Multivariate discretization for set
mining. Knowledge and Information Systems, 2001,
3(4): 491-512.
[6]: Stephen D. Bay and Michael J. Pazzani. Detecting
group differences: Mining contrast sets. Data Mining and
Knowledge Discovery, 2001, 5(3): 213-246.
[7]: CAIM Discretization AlgorithmLukasz A. Kurgan
[8]: Effective Supervised Discretization for Classification
based on Correlation Maximization
Qiusha Zhu, Lin Lin, Mei-Ling Shyu
[9]: X.S.Li, D.Y.Li. A New Method Based on Density
Clustering for Discretization of Continuous Attributes,
Journal of SystemSimulation, 15(6):804-806,813,2005
[10]: R.Kass, L.Wasserman. A reference Bayesian test
for nested hypotheses and its relationship to the Schwarz
criterion, Journal of the American Statistical Association,
Vol.90:928-935, 1995.
[11]: Comparative Analysis of Supervised and
Unsupervised Discretization Techniques
Rajashree Dash