You are on page 1of 5



ISSN: 22503676
Volume - 2, Special Issue - 1, 56 60

Navya Dhulipala1, Tejaswi Pinniboyina2, Radha Rani Deevi3, Parvathi Devi Karumanchi4,C.S.E,KLUniversity,Vaddeswaram,A.P,India,,C.S.E,KLUniversity,Vaddeswaram,A.P,India, 3 Asst. Prof,C.S.E,KLUniversity,Vaddeswaram,A.P,India, 4,C.S.E,KLUniversity,Vaddeswaram,A.P,India,
2 1

This paper presents the community web directories by using different web usage mining techniques for usage data pattern. It automatically creates the user models to overcome the information overload problem that occurs during the manual creation of web directories. We perform the similarity measure in filtering to obtain the efficient user sessions by using cosine based similarity measure and obtain the categorization using hierarchical approach. This paper presents the community directory miner algorithm that specifies the navigation pattern of web pages by using a specific threshold value that combines the thematic information and users browsing information by using different machine learning techniques.

Index Terms: web usage mining, clustering techniques, personalization, --------------------------------------------------------------------- *** -----------------------------------------------------------------------1. INTRODUCTION
The webpage consists of large amount of data. It may lead to get confusion for the customer. Because several unrelated topics are present in a single web page that may lead to confusion and make harder to reach the information that the visitors are looking for. All these information is available online but are hidden for the users. Presently, there is no powerful technique that can analyze this hidden information, so we use web usage mining (WUM) which is used for analyzing the visitor browsing behavior by using user models in [1]. Generally these models relay on manual creation, either by users or by domain expert. As the accessing of web pages increases rapidly, it became necessary to increase the quality of information provided to the users. During this process of navigation between the web pages causes an overload problem [2]. In order to overcome these overload problem we use personalization to the web based information by using web usage mining. The personalization is done by considering user information to better design products and services that is adaptive to the users. According to srivastava et al. [3], web usage mining is not considered extensively for personalization. It primarily focuses on the data models to be evaluated and exploited by human experts. In order to evaluate the users browsing behavior automatically we use user models. It primarily focuses on statistical usage of data. As the usage of data on web pages increases we use personalization to provide dynamic usage of data. Web usage mining defined in [4] is done in three steps: data preprocessing, pattern discovery, pattern analysis where the usage pattern is identified by considering log files present in the web server of ISP. In this process we create web directories by obtaining log files. A web directory is not a search engine and does not display lists of web pages based on keywords; instead, it lists web sites by category and subcategory. In this process we perform content-based filtering [5] systems generate recommendations based on the pre-constructed user profiles by measuring the similarity of Web content to these profiles. In this process of construction of Web directories according to the preferences of user communities, by combining document clustering and usage mining techniques. We perform the taxonomy by constructing the hierarchal tree structure by using artificial web directory, where the user data is constant. In this process of mining the usage data, log files are obtained from the web servers that contain the session information of the users. In order to automatically create the web directories based on the user models. We perform the community directory miner algorithm that determines the overload

IJESAT | Jan-Feb 2012

Available online @ 56

NAVYA DHULIPALA* et al. [IJESAT] INTERNATIONAL JOURNAL OF ENGINEERING SCIENCE & ADVANCED TECHNOLOGY problem based on the specified threshold value to determine the classification of older pages.

ISSN: 22503676
Volume - 2, Special Issue - 1, 56 60

probability inference. Thus we create community web directories that are of similar interest. In [8], PLSA was used in web usage mining that combines clustering and probabilistic modeling to identify and characterize user interests inside certain Web sites. The latent factors segmented user sessions to support a personalized recommendation process. In [9], the similarities between users are determined based on navigation data. The session information was used to create clusters of ODP categories [10]. These clusters were further exploited to recommend shortcuts within the Web directory.

Lang(1995) discovered a tool, which is able to adaptively construct user models from a users browsing behavior based on the similarity between web document containing new items. Pabarskaite (2002) [6] states that preprocessing of web log file plays an important role in WUM and takes 80% of total time of web mining. The cleaning technique was presented to remove the irrelevant links from log file. A filtered web log is obtained by comparing the both raw web log and link table. A number of personalized services employ machine learning methods, particularly clustering techniques, to analyze Web usage data and extract useful knowledge for the recommendation of links to follow within a site, or for the customization of Web sites to the preferences of the users. Basically in the preprocessing data cleaning is performed and web pages are downloaded using cookies that determines the unique id of the users Mobasher et al(2000a) [3] classify web personalization technique into two generic approaches: Content-based filtering systems generate recommendations based on the preconstructed user profiles by measuring the similarity of Web content to these profiles. In contrast, collaborative filtering systems make recommendation by utilizing the rating of current user for objects via referring other users preference that is closely similar to current one. Yan et al. (1996) [3] identify user sessions by modifying the NCSA https server, in order to include session identifiers in the Web pages. The first time a Web page is requested from a specific host-IP address, an identifier is embedded in this page corresponding to the start of a user session Basically hierarchical clustering process is done in mining process that builds a tree of clusters, also known as a dendrogram [18]. Every cluster node contains child cluster that determine the ranking for web pages of the parent process based on the number of leaves. Pruning is the process that reduces the dynamic representation of graphs in the hierarchal process. By obtaining the topics of user interest we obtain a technique called latent semantic analysis technique in [7] that provides a powerful means to capture user access pattern and associated task space. It proposed a collaborative Web recommendation framework, which employs Latent Dirichlet Allocation (LDA) to model underlying topic-simplex space and discover the associations between user sessions and multiple topics via

In order to overcome the overload problem we construct community web directories by personalizing the web usage mining. Web usage mining is done automatically by identifying the user models. User communities are formed using data collected from Web proxies as users browse the Web. The goal is to identify interesting behavioral patterns from the collected usage data and construct community Web directories based on those patterns. We use 3 process steps in order to construct community directories by using web logs by the user models. 3.1 Usage Data Preparation It comprises the collection and cleaning of the usage data, as well as the identification of user sessions. We obtain the usage data from the log files that contains the users clique information. Due to large amount of irrelevant information in the web log, the original log file cannot be directly used in the web usage mining (WUM) procedure in [11]. Therefore the preprocessing of web log file becomes imperative. The following are some preprocessing tasks (a) Data Cleaning: The server log is examined to remove irrelevant items. (b) User Identification: To identify different users by overcoming the difficulty produced by the presence of proxy servers and cache. (c) Session Identification: The page accesses must be divided into individual sessions according to different Web users.

Web server collects this log file information based on the navigation pattern of the user session. User session groups the log data by date and ipaddress and it records the same ipaddress into a separate session.

IJESAT | Jan-Feb 2012

Available online @ 57

NAVYA DHULIPALA* et al. [IJESAT] INTERNATIONAL JOURNAL OF ENGINEERING SCIENCE & ADVANCED TECHNOLOGY In this preprocessing task we perform the filtering technique by determining the similarity measure for web sessions. It is done based on the time a user spends on a webpage, and also the frequency of visit of each page within the session [12]. 2. 3.

ISSN: 22503676
Volume - 2, Special Issue - 1, 56 60

Assign the weights to documents to construct community models. By using these keyword set we assign the web pages into a concept hierarchal.

Let p = { be the web pages then the frequency to find the web pages is done. Freq =

In this hierarchal process we use pruning to reduce the leaf nodes, this leads to overload problem. To overcome this we use usage data and thematic information to construct community directories with a specified threshold value.

In order to find the similarity measure for user sessions we perform the distance matrix, where the rows determine the session and columns determine the urls. This is determined in m*n matrix. We compute the similarity between two sessions by calculating the cosine based similarity measure denoted by sim(i,j) is given by Cos( (

By performing this preprocessing stage we obtain the user session information that helps in providing the web directory.

Markov model: Markov model [13] is used to predict the navigational behavior of web pages. It is used to predict the most frequently accessed page by the user. It is used to find the probability of transmission from page to . 3.2 Web directory initialization: In order to personalize the web directories we use thematic categorization of web pages. This categorization reduces the dimensionality and the semantic diversity of data. The user session information collected based on the similarity measure is assigned into clusters, weighted according to the size of each cluster [14]. Based on these weights the artificial web directory is created by using a hierarchical agglomerative approach [15]. But this approach leads to problem with the association of usage data. In order to automatically classify the web pages, sessions are translated into binary feature vector to construct the taxonomy of web pages [16]. The community model are identified by constructing the taxonomy based on the keyword generation 1. It first identifies the frequency of occurrences of term in classes like title, heading, and plain text. 3.3 Web directory creation: The encoded vector space representation is used to discover pattern of interest in the form of community models. In order to associate the web pages we use the clustering algorithm. For creating the community models we use community web directory miner algorithm [2] [15]. Let A and B be the vertices and edges. Step 1: Compute frequencies of categories that correspond to the weights of the vertices.

Fig 1: Process diagram for creation of community web directories using log files.


is the jth session comes corresponds to category i.

Step 2: Compute co-occurrence frequencies between categories that correspond to the weights of the edges.

IJESAT | Jan-Feb 2012

Available online @ 58


ISSN: 22503676
Volume - 2, Special Issue - 1, 56 60

clusters are formed. Then we create a sub graph whose vector representation changes based on the user visits. Where k is the co occurrence of vector j. Step 3: Introduce a connectivity threshold to remove the edges of the graph with weights less than or equal to its value. Step 4: Turn the weighted graph of categories into an unweighted one by removing all the weights from the nodes and the edges and find all the maximal cliques. In order to perform the navigation patterns we use community directory algorithm that determines the weights between the edges and vertices. The navigation between the web pages is removed when it obtains a less weight. To determine the efficiency we use the convergence graph by using the threshold.


In this paper we first present the preprocessing task by collecting the log file data. Data cleaning is performed for the usage data by using web crawler. In this process we take data base of a collage that contains a record of 500 user access information. Based on this log data we create user sessions with in a threshold of 30 minutes. Similarity measure is done by determining the transition matrix that assigns the urls with visiting page to 1 and others to 0. We create the user sessions that contain the same IPaddress. In this paper we perform personalization to the web directories to overcome the problem that occurs during the navigational of web pages. This personalization is done based on web usage mining for web log data. In this paper we perform the similarity measure during filtering to determine the efficient web user sessions. We perform community directory miner algorithm that perform the threshold for a community directories to prune the thematic categories of web pages. In this process we perform the user models to automatically determine the navigational patterns of user interests by using different machine learning algorithm.

We like to express our gratitude to all those who gave us the possibility to carry out the paper. We would like to thank Mr.K.Satyanarayana, chancellor of K.L.University, Dr.K.Raja Sekhara Rao, Dean, K.L.University for stimulating suggestions and encouragement. We have further more to thank Prof. S.Venkateswarlu, Dr.K.Subramanyam, Dr.G.Rama Krishna , Who encouraged going ahead with this paper

[1] Tasawar Hussain, Dr. Sohail Asghar, Dr. Nayyer Masood, Web Usage Mining: A Survey on Preprocessing of Web Log File, 2010. Dimitrios Pierrakos, Georgios Paliouras, Personalizing Web Directories with the Aid of Web Usage Data, IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 22, NO. 9, Sep 2010. D. Pierrakos, G. Paliouras, C. Papatheodorou, and C.D.Spyropoulos, Web Usage Mining as a Tool for Personalization: A Survey, User Modeling and UserAdapted Interaction, vol. 13,no. 4, pp. 311- 372, 2003. J. Srivastava, R. Cooley, M. Deshpande, and P.T. Tan, Web Usage Mining: Discovery and Applications of

Fig 2: graph representing the coverage of the community directories with and without that have a specific threshold


This session data is converted into binary vector representation and represent the web pages into hierarchical approach based on the clustering algorithm. As the numbers of clusters are formed during the weighted k-means clustering algorithm [17] . We first specify the probability of occurrence of web page within a specific threshold value such as P>0.5. Then we identify community category based on the number of clusters formed during the cluster mining. In this process 5



IJESAT | Jan-Feb 2012

Available online @ 59

NAVYA DHULIPALA* et al. [IJESAT] INTERNATIONAL JOURNAL OF ENGINEERING SCIENCE & ADVANCED TECHNOLOGY Usage Patterns from Web Data, Explorations, vol. 1, no. 2, pp. 12-23, 2000. [5] SIGKDD [17]

ISSN: 22503676
Volume - 2, Special Issue - 1, 56 60

J.S. Breese, D. Heckerman, and C. Kadie, Empirical Analysis of Predictive Algorithms for Collaborative Filtering, Proc. 14th Conf. Uncertainty in Artificial Intelligence (UAI 98), pp. 43-52, 1998. Tasawar Hussain, Dr. Sohail Asghar, Dr. Nayyer Masood, Web Usage Mining: A Survey on Preprocessing of Web Log File, 2010. G. Xu, Y. Zhang, and Y. Xun, Modeling User Behaviour for Web Recommendation Using lda Model, Proc. IEEE/WIC/ACM Intl Conf. Web Intelligence and Intelligent Agent X. Jin, Y. Zhou, and B. Mobasher, Web Usage Mining Based on Probabilistic Latent Semantic Analysis, Proc. ACM SIGKDD, pp. 197-205, Aug. 2004. Mozhgan Azimpour-Kivi, Reza Azmi, A Webpage Similarity Measure for Web Sessions Clustering Using Sequence Alignment, IEEE, 2011. T. Dalamagas, P. Bouros, T. Galanis, M. Eirinaki, and T. Sellis, Mining User Navigation Patterns for Personalizing Topic Directories, Proc. Ninth Ann. ACM Intl Workshop Web Information and Data Management, pp. 81-88, 2007. Theint Theint Aye, Web Log Cleaning for Mining of Web Usage Patterns, IEEE, 2011. D.Vasumathi, A.Govardhan, K.Suresh, Effective Web Personalization Using Clustering, IEEE, 2009. Ching-Ming Chao, Shih-Yang Yang, Po-Zung Chen and Chu-Hao Sun, An Online Web Usage Mining System Using Stochastic Timed Petri Nets, Fourth International Conference on Ubi-Media Computing, IEEE, 2011. Y. Zhao and G. Karypis, Evaluation of Hierarchical Clustering Algorithms for Document Datasets, Proc. ACM Intl Conf. Information and Knowledge Management (CIKM), pp. 515-524, Nov. 2002, D. Pierrakos, G. Paliouras, C. Papatheodorou, V. Karkaletsis, and M. Dikaiakos, Web Community Directories: A New Approach to Web Personalization, Web Mining: From Web to Semantic Web, B. Berendt et al., eds., pp. 113-129, Springer, 2004. C. Christophi, D. Zeinalipour-Yazti, M.D. Dikaiakos, and G. Paliouras, Automatically Annotating the ODP Web Taxonomy, Proc. 11th Panhellenic Conf. Informatics (PCI 07), 2007.



Joshua Zhexue Huang, Michael K. Ng, Hongqiang Rong, and Zichen Li, Automated Variable Weighting in k-Means Type Clustering, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol 27, May 2005, pp 657-668. K.Santhisree, Dr A. Damodaram, S.Appaji, D.NagarjunaDevi. Web Usage Data Clustering using Dbscan algorithm and Set similarities, IEEE Computer society ,2010


Navya Dhulipalla is pursuing from KLUniversity in Computer Science Engineering and completed her B.Tech from JNTUK in 2010.




Tejaswi Pinniboyina is pursuing from KLUniversity in Computer Science Engineering and completed her B.Tech from PVPSIT in 2010.

[11] [12] [13]

Deevi Radha Rani working as Assistant Professor in the Department of CSE, KL University.


Parvathi Devi Karumanchi is pursing M.Tech from K.L.University in Computer Engineering



IJESAT | Jan-Feb 2012

Available online @ 60