Академический Документы
Профессиональный Документы
Культура Документы
INTRODUCTION
The explosive growth of the Web has drastically changed the way in which
information is managed and accessed. The large-scale of Web data1 sources and the wide
availability of services over the Internet have increased the need for effective Web data
management techniques and mechanisms. Understanding how users navigate over Web
sources is essential both for computing practitioners and researchers. In this context, Web
data clustering has been widely used for increasing Web information accessibility,
understanding users’ navigation behavior, improving information retrieval and content
delivery on the Web.
Computer user with an enormous flood of information. To almost any topic one
can think of, one can find pieces of information that are made available by other internet
citizens, ranging from individual users that post an inventory of their record collection, to
major companies that do business over the Web. To be able to cope with the abundance
of available information, users of the WWW need to rely on intelligent tools that assist
them in finding, sorting, and filtering the available information. Just as data mining aims
at discovering valuable information that is hidden in conventional databases, the
emerging field of Web mining aims at finding and extracting relevant information that is
hidden in Web-related data, in particular in text documents that are published on the
Web. Like data mining, Web mining is a multi-disciplinary effort that draws techniques
from fields like information retrieval, statistics, and machine learning, natural language
processing, and others.
The World Wide Web has become increasingly important as a medium for
commerce as well as for dissemination of information. In E-commerce, companies want
to analyze the user’s preferences to place advertisements, to decide their market strategy,
and to provide customized guide to Web customers. In today’s information based society,
there is an urge for Web surfers to find the needed information from the overwhelming
resources on the Internet. Web access log contains a lot of information that allows us to
observe user’s interest with the site. Properly exploited, this information can assist us to
make improvements to the Web site, create a more effective Web site organization and to
help users navigate through enormous Web documents. Therefore, data mining, which is
referred to as knowledge discovery in database (KDD), has been naturally introduced to
the World Wide Web.
Web usage data collected in access log is at a very fine granularity. It usually
includes every HTTP request from all users. Each request contains at least the IP address,
requested pages, time requested, response code, and size of the item requested. Therefore,
while the access log has the advantage of being extremely detailed, it also has some
drawbacks. When we apply statistical and probability methods to it, we tend to get results
that are too refined than it should be because the analysis might focus on micro trends
rather than macro trends. However, based on our observation, user’s browsing behavior
on the Web is highly uncertain. Users might browse the same page for different purposes,
spend various amounts of time on the same page or make different number of visits on it,
or even get to the page from different sources each time. Therefore, micro trends tend to
be erroneous and not of much use.
Depending on the nature of the data, one can distinguish three main areas of
research within the Web mining community:
Web Content Mining: application of data mining techniques to unstructured or semi
structured data, usually HTML-documents
Web Structure Mining: use of the hyperlink structure of the Web as an (additional)
information source
Web Usage Mining: analysis of user interactions with a Web server (e.g., click stream
analysis)
Pattern Discovery: Web Usage mining can be used to uncover patterns in server
logs but is often carried out only on samples of data. The mining process will be
ineffective if the samples are not a good representation of the larger body of data.
Web usage mining can use various data mining or machine learning techniques to
model and understand Web user activity. In clustering was used to segment user sessions
into clusters or profiles that can later form the basis for personalization. Inthe notion of an
adaptive Web site was proposed, where the user’s access pattern can be used to
automatically synthesize index pages. The work in is based on using association rule
discovery as the basis for modeling Web user activity, whereas the approach proposed in
used probabilistic grammars to model Web navigation patterns for the purpose of
prediction. The approach in proposed building data cubes from Web log data and later
applying Online Analytical Processing (OLAP) and data mining on the cube model. Web
Utilization Miner (WUM) was presented to discover navigation patterns with user
specified characteristics over an aggregated materialized view of the Web log, consisting
of a tire of sequences of Web views. Web usage have recently become important. This is
because Web access patterns on a Web site are dynamic due not only to the dynamics of
Web site content and structure but also to changes in the user’s interests and, thus, their
navigation patterns.
In order to create the user's groups, i.e., user profiles only have access to the user's
browsing history. Assume that users with similar browsing patterns upon a point in time
should have similar interests and motivations. User profiles can be created through online
web usage mining, which consists in discovering web usage patterns to better understand
the users' behavior. For the task of creating user groups based on the web server access
logs, web usage mining uses clustering techniques. During a visit to a web site, the users'
requests are registered in a web server access log format stored in the web server.
Therefore, the web server access logs provide the means to create a data set prepared for
the application of clustering algorithms. User's interests are affected by the temporal
context, thus in some research work instead of creating user clusters it is presented the
concept of session clustering. A session comprises the browsing history of small time
windows, usually 30 minutes. Therefore by clustering sessions it is easier to comprehend
the contextual motivations of each user and provide ads suitable for current user's
interests. Propose to create user's profiles in a two stage process. First, one creates session
clusters, then creates users clusters based on common sessions between users. This work
is going to focus on the first part, create session clusters. It compare the resulting session
clusters by using different attributes to describe a session. Our approach for representing
a session consists in combining descriptions extracted from the URLs of the pages visited
with temporal frames based on date, such as Monday morning.
Session Identification
A session is a list of web pages accesses from a given user during a period of time.
Each access is registered in a line of the web server access log. For the task of identifying
the list of web pages visited during a user's session it is necessary to clean all the
information contained in the web server access logs that is meaningless or not relevant.
Though, browser and proxy caching represent a major drawback to the creation of a
reliable user session data set. The web server access log is a text file that contains all the
requests made to the web server, and usually they are in a Common Log Format , which
means that it contains the following fields:
_ IP address or domain name
_ User ID
_ Date and time of the request
_ HTTP request (including method and page requested)
_ Status code response to the request
_ File size
_ Referrer (web page that contain the hyperlink that originated the request)
_ Web agent (user's browser)
The web server access logs used during this work contain accesses to web pages
from several web sites and in this case, the URL of the web pages is in the referrer. There
is also extra information about the request such as a session cookie and a long duration
cookie. The session cookie identifies a 30 minutes session and the long duration cookie
identifies a user. Therefore, only web server access log entries containing the session
cookie were considered. From these entries the web page URL (referrer), date and
session cookie are the meaningful data for the purpose of this work. Thereafter, these
parameters were grouped by common session cookie in order to create each session
representation vector.
Among them, clustering allows us to group together clients or data items that have
similar characteristics. The information discovered by this technique is one of the most
important types that has a wide range of applications from real-time personalization to
link prediction. It can facilitate the development of future marketing strategies, such as
automated return mail, present advertisements to clients falling within a certain cluster, or
dynamically changing a particular site for a client on a return visit based on past
classification of that client. The key problem lies in how we effectively discover clusters
of Web pages or users with common interest. Clustering analysis to mine the Web is
quite different from traditional clustering due to the inherent difference between Web
usage data clustering and classic clustering. Therefore, there is a need to develop
specialized techniques for clustering analysis based on Web usage data. Some approaches
to clustering analysis have been developed for mining the Web access logs.
Session Clustering
Thereafter the transformation of user sessions into a multi-dimensional space
as vectors of extracted attributes, clustering algorithms can partition this space into group
of sessions. Each session within a group has a close distance between the others in the
group, based on a distance measure. Regarding the clustering algorithms both model-
based and similarity-based are used to group users or sessions, as well as, hierarchical
and partitional techniques. The most common model-based algorithm is the Expectation-
maximization (EM) algorithm which has been used to identify associations among users
and pages as well to provide user profiles.
The proposed methodology is applied on Web users’ navigation patterns by a
model-based approach employing:
Cluster validation, i.e. evaluation of the results of a clustering algorithm in a
quantitative and objective manner. We propose a quantitative validation procedure, which
is based on the statistical chi-square (v2) test. Each cluster is represented by a probability
distribution and the chi-square metric is used to measure the distances between these
distributions and to test their homogeneity. Since the goal of a clustering procedure is to
discover groups in the data so that each group is significantly different from all the
others, It essentially test the heterogeneity between the clusters in order to assess their
successful discrimination.
Cluster interpretation, i.e. understanding and appropriately interpreting the
meaning of the derived clusters in the wider context of the underlying application, by
using statistical data analysis. Specifically, propose a visualization approach as a result
of the statistical method known as correspondence analysis, for interpreting
the clustering results. This analysis is used to facilitate revealing of similar or related
features in Web users’ navigation behavior and their interaction with the content of Web
information sources.
Clustering evaluation may be employed under three different views:
1. External view: when results of a clustering method are evaluated on the basis of
a pre-specified structure on a data set, which reflects a user’s intuition about the
clustering structure of this data set.
2. Internal view: clustering results are evaluated in terms of quantities obtained
from the data set itself.
3. Relative view: clustering result is compared with other clustering schemes, by
modifying only the parameter values.
System Architecture:
Trace user session
details
Clustering Selection
Web user
Session
Clustering Hierarchical
Tech Agglomerative
Clustering Creation
Modules
2. Clustering selection
3. Hierarchical Agglomerative
4. Clustering Creation
In this module we trace user session details. Web user is identified by its client IP
address and by connections having TCP server port equal to 80 (HTTP protocol). Each
user trace, i.e., a trace containing only data with a given IP source address, is
preprocessed according to the following steps: i) data are partitioned day by day, ii) only
working hours of working days are considered, and iii) opening times of two consecutive
connections separated by more than half an hour are considered a priori as two
independent data sets.
2. Clustering Selection
In this module we read user session details .Using K means clustering initially cluster
the user session details.
3. Hierarchical Agglomerative:
A partition clustering procedure is run over the original data set, which includes all
samples using the optimal number of clusters determined so far and the same choice of
cluster representatives adopted in the first step. A fixed number of iterations is run to
obtain a final refinement of the clustering.
LANGAUGE SPECIFICATION
C#, Visual Basic and Java Script. The .NET framework provides the
on).
collection.
running code.
worth description:
Managed Code
extra
and unmanaged code can run in the runtime, only managed code
unmanaged code.
defined. Components that follow these rules and expose only CLS
System; this contains basic types like Byte, Double, Boolean, and
Microsoft’s old favorites Visual Basic and C++ (as VB.NET and
the family.
has been designed with the intention of using the .NET libraries as
its own.
programming languages.
• FORTRAN
• COBOL
• Eiffel
ASP.NET Windows
Forms
XML WEB
SERVICES
Base Class Libraries
Common Language Runtime
Operating System
supports structured exception handling. CLS is set of rules and constructs that
are supported by the CLR (Common Language Runtime). CLR is the runtime
code and also makes the development process easier by providing services.
addition, we can use objects, classes, and components created in other CLS-
the application.
destroy them. In other words, destructors are used to release the resources
allocated to the object. In C#.NET the sub finalize procedure is available. The
sub finalize procedure is used to complete the tasks that must be performed
GARBAGE COLLECTION
In C#.NET, the garbage collector checks for the objects that are not currently in
use by applications. When the garbage collector comes across an object that is
marked for garbage collection, it releases the memory occupied by the object.
OVERLOADING
procedures with the same name, where each procedure has a different set of
interaction.
remotely.