Академический Документы
Профессиональный Документы
Культура Документы
6480(Print), ISSN 0976 6499(Online) Volume 5, Issue 5, May (2014), pp. 91-101 IAEME
91
EXPERIMENTAL EVALUATION OF DIFFERENT CLASSIFICATION
TECHNIQUES FOR WEB PAGE CLASSIFICATION
Ms. Rutu Joshi
1
, Priyank Thakkar
2
Department of Computer Science and Engineering,
Institute of Technology, Nirma University, Ahmedabad
ABSTRACT
Classification of web pages is essential for improving the quality of web search, focused
crawling, development of web directories like Yahoo, ODP etc. This paper compares various
classification techniques for the task of web page classification. The classification techniques
compared include k-Nearest Neighbours (KNN), Naive Bayes (NB), Support Vector Machine
(SVM), Classification and Regression Trees (CART), Random Forest (RF) and Particle Swarm
Optimization (PSO).Impact of using different representations of web pages is also studied. The
different representations of the web pages that are used comprise Boolean, bag-of-words and Term
Frequency and Inverse Document Frequency (TFIDF). Experiments are performed using WebKB
and R8 data sets. Accuracy and F-measure are used as the evaluation measures. Impact of feature
selectionon the accuracy of the classifier is moreover demonstrated.
Keywords: Classification and Regression Trees (CART), K-Nearest Neighbours (KNN), Naive
Bayes (NB), Particle Swarm Optimization (PSO), Random Forest, Support Vector Machine (SVM),
Web Page Classification.
1. INTRODUCTION
The internet consists of millions of web pages corresponding to each and every search word
which provides highly useful information. Search engines help users retrieve web pages related to a
keyword but searching those innumerable pages is tedious. Also, web pages are dynamic and volatile
in nature. There is no unique format for the web pages. Some web pages may be unstructured (text),
some pages may be semi structured (HTML pages) and some pages may be structured (database).
This heterogeneous format on the web presents additional challenges for classification. Hence it is
important for us to find a technique which accurately classifies web pages and provide only the most
INTERNATIONAL JOURNAL OF ADVANCED RESEARCH
IN ENGINEERING AND TECHNOLOGY (IJARET)
ISSN 0976 - 6480 (Print)
ISSN 0976 - 6499 (Online)
Volume 5, Issue 5, May (2014), pp. 91-101
IAEME: www.iaeme.com/ijaret.asp
Journal Impact Factor (2014): 7.8273 (Calculated by GISI)
www.jifactor.com
IJARET
I A E M E
International Journal of Advanced Research in Engineering and Technology (IJARET), ISSN 0976
6480(Print), ISSN 0976 6499(Online) Volume 5, Issue 5, May (2014), pp. 91-101 IAEME
92
relevant web pages. Classification is a data mining technique which predicts pre-defined classes for
data sets. Classification is a supervised learning technique. Here, the classifier is learnt using the
training data set. The trained classifier then assigns class labels to the testing data set. In Web Page
classification, web pages are assigned to pre-defined classes mainly according to their content [1].
The rest of the paper is organized as follows. Section 2 focuses on related work. Section 3
discusses different classifiers for web page classification. Implementation methodology is described
in Section 4. The paper finally ends with conclusions as section 5.
2. RELATED WORK
Classification of web pages using various techniques was extensively studied. Four
classification techniques namely decision trees, k-nearest neighbour, SVM and naive Bayes were
discussed in [4]. The paper focussed on obtaining accurate system results. When decision tree gave
accurate result, Bayesian network did not and vice versa due to their different operational profiles.
Since many methods of web page classification were proposed, no clear conclusion about the best
method was obtained. In [5],the best result was obtained with SVM employing linear kernel function
(followed by method of k-nearest neighbours) and term frequency (TF) document model using
feature selection by mutual information score. Here, special attention was paid while treating short
documents, which frequently occurred on the web.
In [6], authors concentrated on the effects of using context features (text, title and anchor
words) in web page classification using SVM classifiers. Experiment showed that SVM technique
gave very good result on the WebKB data set even using the text components only. Also, the
accuracy of classification improved significantly when context features consisting of title
components and anchor words were used. But elimination of anchor words could not render
consistently good classification results for all the classes of data set. The performance of SVM was
compared using four different kernel functions in [7]. Experimental results showed that out of these
four kernel functions, Analysis of Variance (ANOVA) kernel function yielded the best result.
Thereafter, Latent Semantic Analysis SVM (LSA-SVM), Weighted Vote SVM (WVSVM) and Back
Propagation Neural Network (BPN) were also compared. From the experimental results it was
concluded that WVSVM could classify accurately even with a small data set. Even if the smaller
category had less training data, WVSVM could still classify those web pages with acceptable
accuracy. Whereas in [9], even with a small data set LS-SVM yielded better accuracy with faster
speed and reduced runtime of the algorithm.
PSO produced more accurate classification models than associative classifiers in [10]. PSO
was used for classifying multidimensional real data set in [11], where the parameters were tuned in
such a way that it gave the best result. PSO, KNN, Naive Bayes and Decision Tree classification
techniques were applied on Reuter-21578 and TREC-APdata set and the results were compared in
[12]. The experimental results indicated that PSO yielded much better performance than other
conventional algorithms.
Three different fitness functions were used in [13] on different data sets. PSO was compared
with nine other classification techniques, like Multi-Layer Perceptron Artificial Neural Network
(MLP), Bayes Network, Naive Bayes Tree etc. Here, PSO was in fourth position, quite close to its
predecessors. Also, PSO seemed effective for two class problem but contrasting results were
obtained for more than two classes. Hence, no clear conclusion was inferred.
International Journal of Advanced Research in Engineering and Technology (IJARET), ISSN 0976
6480(Print), ISSN 0976 6499(Online) Volume 5, Issue 5, May (2014), pp. 91-101 IAEME
93
3. CLASSIFICATION METHODS
The following classifiers are used in this paper for the task of web page classification.
3.1 KNN Classifier
K-Nearest Neighbour is a lazy learning method [2]. Here, the training data set is not used to
train the classifier. For a test instance say , the KNN method compares with the training data set to
find the k most similar training instances. It then returns the class which represents the maximum of
the instances of the data set. Normally 1 is not opted for classification due to noise and other
anomalies in the data set. Hence, 3 is chosen forKNN classification in this study.
3.2 Naive Bayes Classifier
Naive Bayes classifier is based on Bayes theorem. Here, classification is considered as
estimating the posterior probabilities of class C for test instance X.
PC|X=
PX|CPC
PX
PX|C=P(x
1
, x
, ,
|
Where,| is the posterior probability of class given attribute, | is the probability
of predictor given class, is the prior probability of class and is the prior probability of
predictor. Here, for each class, probability is calculated. Thereafter, the class which has the highest
probability is assigned to the test instance.
Based on how the web pages are represented, appropriate distribution is fitted to the data.
Gaussian distribution is fitted when web pages are represented by means of TFIDF scores of the
terms. For Boolean representation, multivariate Bernoulli distribution and for bag-of-words
representation, multinomial distribution is fitted to the data.
3.3 Support Vector Machine (SVM)
SVM is one of the most popular classification method. SVM uses supervised learning
technique and can be used for both classification and regression. In general, linear SVMs are used for
binary classification. For more than two classes, SVM network. To build a classifier, SVM finds a
maximum margin hyper plane . Thereafter, an input vector say
is assigned to the
positive class, if
and
and
are
random values in the range 0,1,
International Journal of Advanced Research in Engineering and Technology (IJARET), ISSN 0976
6480(Print), ISSN 0976 6499(Online) Volume 5, Issue 5, May (2014), pp. 91-101 IAEME
97
4.5 Results and Discussions
All the classification methods discussed in section 3 are applied on the three different
representations of web page collection of both the data set. Number of features are varied to see the
impact on the performance of the classifiers.
Figure 4.1: Impact of feature selection on F-measure (WebKB data set, Boolean representation)
Figure 4.2: Impact of feature selection on F-measure (WebKB data set, TFIDF representation)
Figures 4.1 to 4.3 show the impact of feature selection on f-measure for Boolean, TFIDF and
bag-of-words representation respectively for WebKB dataset. Similar results for R8 data set are
depicted in Figures 4.4 to 4.6. It can be seen that a classifier learnt using appropriate number of
International Journal of Advanced Research in Engineering and Technology (IJARET), ISSN 0976
6480(Print), ISSN 0976 6499(Online) Volume 5, Issue 5, May (2014), pp. 91-101 IAEME
98
features improves the performance. It is also evident that LS-SVM is most sensitive to the number of
features.
Figure 4.3: Impact of feature selection on F-measure (WebKB data set, bag-of-words representation)
Figure 4.4: Impact of feature selection on F-measure (R8 data set, Boolean representation)
Figures 4.7 to 4.9 depict the best performance of each of the classification techniques for both
the data set. Number of features when each of the classification techniques has performed the best,
along with the best performance, is also shown in this figures.
International Journal of Advanced Research in Engineering and Technology (IJARET), ISSN 0976
6480(Print), ISSN 0976 6499(Online) Volume 5, Issue 5, May (2014), pp. 91-101 IAEME
99
Figure 4.5: Impact of feature selection on F-measure (R8 data set, TFIDF representation)
Figure 4.6: Impact of feature selection on F-measure (R8 data set, bag-of-words representation)
It can be seen that, random forest achieves the best results for both the data sets. However,
the best results are obtained for bag-of-words representation in case of R8 data set while in case of
WebKB data set, best results are achieved for TFIDF representation.
5. CONCLUSIONS
This paper addresses the task of classifying web pages using various classification
techniques. Performance of KNN, NB, SVM, CART, RF and PSO is compared for different possible
representation of web pages. Among all the methods, Random Forest (RF) gives best overall result.
Results also demonstrate that the performance of the classifier is affected by the representation used.
International Journal of Advanced Research in Engineering and Technology (IJARET), ISSN 0976
6480(Print), ISSN 0976 6499(Online) Volume 5, Issue 5, May (2014), pp. 91-101 IAEME
100
Figure 4.7: Best F-measure (Boolean representation)
Figure 4.8: Best F-measure (TFIDF representation)
Figure 4.9: Best F-measure (Bag-of-words representation)
It can be seen from the results that different classification techniques perform best for
different representation of the web pages. This implies that there is no single representation which
works best for all the classification techniques. One should select the representation based on the
techniques to be used. Impact of feature selection is also studied in the paper and results show that
selecting right number of features definitely improves the performance of the classifier.
International Journal of Advanced Research in Engineering and Technology (IJARET), ISSN 0976
6480(Print), ISSN 0976 6499(Online) Volume 5, Issue 5, May (2014), pp. 91-101 IAEME
101
REFERENCES
[1] Thair Nu Phyu, Survey of Classification Techniques in Data Mining, Proceedings of the
International MultiConference of Engineers and Computer Scientists, vol. I, 2009.
[2] Bing Liu, Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data (Data-Centric
Systems and Applications), Springer-Verlag New York, Inc., Secaucus, NJ, 2006.
[3] R.C.Eberhart, J.Kennedy, A new optimizer using particle swarm theory, Proceedings of the 6th
Symposium MicroMachine and Human Science, IEEE Press, Los Alamitos, CA, October 1995,
pp. 39-43.
[4] M. A. Nayak, A comparative study of web page classification techniques," GIT Journal of
Engineering and Technology, vol. 6, 2013.
[5] J. Materna, Automatic web page classification," 2008.
[6] W.-K. N. Aixin Sun, Ee-Peng Lim, Web classification using support vector machine."
Proceedings of the fourth international workshop on Web information and data management -
WIDM '02, 2002.
[7] R.-C. Chen and C.-H. Hsieh, Web page classification based on a support vector machine using a
weighted vote schema," Expert Systems with Applications, vol. 31, 2006
[8] Breiman, L. and Friedman, J. H. and Olshen, R. A. and Stone, "Classification and Regression
Trees", 1984.
[9] L.-b. X. Yong Zhang, Bin Fan, Web page classification based-on a least square support vector
machine with latent semantic analysis," Fifth International Conference on Fuzzy Systems and
Knowledge Discovery, 2008.
[10] D. Radha Damodaram, Phishing website detection and optimization using particle swarm
optimization technique," 2011.
[11] A. K. J. Sarita Mahapatra and B. Naik, Performance evaluation of pso based classifier for
classification of multidimensional data with variation of pso parameters in knowledge discovery
database," vol. 34, 2011.
[12] D. Z. Ziqiang Wang, Qingzhou Zhang, A pso-based web document classification algorithm,"
Eighth ACIS International Conference on Software Engineering, Artificial Intelligence,
Networking, and Parallel/Distributed Computing (SNPD2007), 2007.
[13] E. T. De Falco, A. Della Cioppa, Facing classification problems with particles warm
optimization," Applied Soft Computing, vol. 7, 2007.
[14] Jiawei Han, Micheline Kamber and Jian Pei, Data mining: concepts and techniques, Morgan
Kaufmann, 2006.
[15] R. Manickam, D. Boominath and V. Bhuvaneswari, An Analysis of Data Mining: Past, Present
and Future, International Journal of Computer Engineering & Technology (IJCET), Volume 3,
Issue 1, 2012, pp. 1 - 9, ISSN Print: 0976 6367, ISSN Online: 0976 6375.
[16] Sandip S. Patil and Asha P. Chaudhari, Classification of Emotions from Text using SVM Based
Opinion Mining, International Journal of Computer Engineering & Technology (IJCET),
Volume 3, Issue 1, 2012, pp. 330 - 338, ISSN Print: 0976 6367, ISSN Online: 0976 6375.
[17] Prof. Sindhu P Menon and Dr. Nagaratna P Hegde, Research on Classification Algorithms and
its Impact on Web Mining, International Journal of Computer Engineering & Technology
(IJCET), Volume 4, Issue 4, 2013, pp. 495 - 504, ISSN Print: 0976 6367, ISSN Online:
0976 6375.
[18] Priyank Thakkar, Samir Kariya and K Kotecha, Web Page Clustering using Cemetery
Organization Behavior of Ants, International Journal of Advanced Research in Engineering
& Technology (IJARET), Volume 5, Issue 1, 2014, pp. 7 - 17, ISSN Print: 0976-6480,
ISSN Online: 0976-6499.
[19] Alamelu Mangai J, Santhosh Kumar V and Sugumaran V, Recent Research in Web Page
Classification A Review, International Journal of Computer Engineering & Technology
(IJCET), Volume 1, Issue 1, 2010, pp. 112 - 122, ISSN Print: 0976 6367, ISSN Online:
0976 6375.