Session Mining Final

1.
INTRODUCTION
A Web user-session, simply named user-session in the remainder of the
paper, is informally defined as a set of TCP connections created by a given user
while surfing the Web during a given time frame. The study of telecommunication
networks has been often based on traffic measurements, which are used to create traffic models and
obtain performance estimates. While a lot of attention has been traditionally devoted to traffic
characterization at the packet and transport layers few are the studies on traffic properties at the
session/user layer.
This is due to the difficulty in defining the session concept itself which depends on the considered
application. Also, the generally accepted conjecture that such sessions follow a Poisson arrival process
might have reduced the interest in the user-session process analysis.
User-session identification and characterization play an important role both
in Internet traffic modeling and in the proper dimensioning of network resources.
Besides increasing the knowledge of network traffic and user behavior, they yield
workload models which may be exploited for both performance evaluation and
dimensioning of network elements. Synthetic workload generators may be defined
to assess network performance,
Finally, the knowledge of user-session behavior is important for service
providers, for example to dimension access links and router capacity. User
behavior can be modeled by few parameters e.g., session arrival rates, data
volumes.
An activity (ON) period on the Web alternates with a silent (OFF) period
during which the user is inactive on the Internet.
Clustering technique:
Clustering techniques are exploratory techniques used in many areas to
analyze large data sets. Given a proper notion of similarity, they find groups of
similar variables/objects by partitioning the data set in similar subsets
The HTTP protocols are having on the characteristics of web traffic in the Internet. For
example, measurements of TCP connection usage for early versions of the HTTP protocols pointed to
clear inefficiencies in design, notably the creation of a different TCP connection for each web object
reference. Recent revisions to the HTTP protocol, notably version 1.1, have introduced the concepts of
persistent connections and pipelining.
Persistent connections are provided to enable the reuse of a single TCP connection for multiple object
references at the same IP address (typically embedded components of a web page).
Pipelining allows the client to make a series of requests on a persistent connection without waiting for
a response between each request we make a number of observations about the evolving nature of web
traffic, the use of the web, the structure of web pages, the organization of web servers providing this
content, and the use of packet tracing as a method for understanding dynamics of web traffic
Packet: as a chunk of information with atomic characteristics, that is either delivered correctly to the
destination or is lost (and most often has to be retransmitted);
Flow: as a concatenation of correlated packets, as in a TCP connection;
Traffic characteristics:
The traffic characteristics or the behavior can be measured using following
1) the packet level
2) the Bandwidth level
The packet level:
Measuring the packet level characteristics are very complex since the big theory is needed to develop
traffic engineering tools.
The packets size distribution, averaged over one week in JAN.01. It confirms that either packets are
very small i.e., pure control packets, or very large.The distribution of the Time To Live (TTL) field
content, distinguishing between incoming and outgoing packets.
Flows:
Flows are defines as the stream of packets, corresponding to transfer of a document. The document
may be a web page, a video clip which needs number of TCP connections. Generally the flows are
small. But the traffic is raised due to the small number of very large flows. Flows do not occur
independently .
Sessions:
The sessions are set TCP connections comes from different number of users.
Where every session can have any number of flows. Since the possibility for traffic is more in session,
the arrival process of sessions is leads to more workload to servers. The important measure in traffic
analysis is identifying the variability of both number of flows in session and size of those flows.
1) A user session begins when a user who has been idle opens a TCP connection.
2) A user session ends, and the next idle period begins, when the user has had no open connections for
consecutive r seconds.
In choosing a particular value of r we wanted to use a time that was on the order of Web browsing
speed. In a typical Web browsing behavior, a human user clicks on a link. The Web browser opens up
several simultaneous
connections to retrieve the information. Then, when the information is presented, the user may take a
few seconds or minutes to digest the information before locating the next desired link. We wanted to be
large enough to keep such a sequence of clicks together.
2. SYSTEM ANALYSIS
2.1 Existing System
Some methods for the session arrival process were presented. However, the
focus is on Telnet and FTP sessions, where each session is related to a single TCP
data connection. No measurements of HTTP sessions are reported.
To identify HTTP user-sessions, traditional approaches rely on the adoption
of a threshold. TCP connections are aggregated in the same session if the
inter-arrival time between two TCP
connections is smaller than the
threshold value. Otherwise, the TCP connection is associated with a newly

created user-session.
The threshold-based approach works well only if the threshold value is
correctly matched to the values of connection and session inter-arrival times
Furthermore, different users may show different idle times, and even the
same user may have different idle periods depending on the service, e.g.,
news or e-commerce, he is accessing. Thus, the a prior knowledge of the
proper threshold value is an unrealistic assumption. If the threshold value is
not correctly matched to the session statistical behavior, threshold based
mechanisms are significantly error prone, To avoid this drawback, we
propose a more robust algorithm to identify user-session.
Many existing system perform a data analysis of server logs to define usersessions. While the server log approach can be very reliable, it lacks the
capability offered by passive measurements performed at the packet level

which permit to simultaneously monitor a user browsing several servers.
Some models adopt a passive sniffing methodology to rebuild HTTP layer
transactions to infer clients/users behaviors. By crawling HTTP protocol
headers, the sequence of objects referred by the initial request is rebuilt.
This allows grouping several TCP connections to form a user-session.
While the server log approach can be very effective, it does not scale well
and, by leveraging on a specific application level protocol, can be hardly
generalized. Furthermore, since the payload of all packets must be
analyzed, this approach is not practical when, for security or privacy
reasons, data payloads (and application layer headers) are not available.
2.2 Proposed System

The main paper goals are:
i)
to devise a technique that permits to correctly identify user-sessions,
ii)
and,
to determine their statistical properties by analyzing traces of measured
data.
The aim of this proposed system is to define a clustering technique to identify

user-sessions. Performance is compared with those of traditional threshold based
approaches, which partition samples depending on a comparison between the
sample to sample distance and a given threshold value
The main advantage of the clustering approach is avoiding the need to define a
prioriany threshold value to separate and group samples. Thus, this methodology
is more robust than simpler threshold based mechanisms.
By running a clustering algorithm, we avoid the need of setting a prioria threshold
value, since clustering techniques automatically adapt to the actual
Users behavior. Furthermore, the algorithm does not require any training phase to
properly run. We test the proposed methodology on artificially generated traces
i)
to ensure its ability to correctly identify a set of TCP connections

belonging to the same user-session,
ii)
to assess the error performance of the proposed technique, and
iii)
to compare it with traditional threshold based mechanisms.
Finally, we run the algorithms over real traffic traces, to obtain statistical
information on user-sessions, such as distributions of
i)
session duration,
ii)
amount of data transferred in a single session,
iii)
number of connections within a single session.
In this paper, TCP headers only are analyzed, limiting privacy issues and
significantly reducing the probe complexity. Furthermore, the proposed approach
may be adopted for any set of sessions generated by the same application,
1) The Hierarchical Agglomerative Approach:
Each sample is initially associated with a different cluster and the
procedure
startup the number of clusters is then, calculated and the basis of the definition of
a cluster-to-cluster distance, the clusters at minimum distance are merged to form
a new cluster. The algorithm iterates this step until all samples belong to the
same cluster.
This approach can be quite time consuming, especially when the data set is
very large, since the initial number of clusters is equal to the number of samples
in the data set . For this reason, non-hierarchical approaches, named partitioned,
are often preferred, since they show better scalability properties.
2) The Partitioned Approach:

This technique is used when the final number of clusters is known. The procedure starts with an initial
configuration including clusters, selected according to some criteria. The final cluster definition is
obtained through an iterative procedure.
6
1) The 4-tuple identifying the connection, i.e., IP addresses and TCP port
numbers of the client and the server;
2) The connection opening time, identified by the timestamp of the first client
SYN message;
3) The connection ending time, identified by the time instant in which the TCP
connection is terminated;
4) The net amount of byte sent from the client and server respectively
(excluding retransmissions.
2.3 Modules
I) Authentication
ii) Data set formulation
Iii) Session Clustering
Module 1: Authentication:
This module is used to make security assurance to that website in which users are browsing.
This Module is very must since the sites of these systems will have huge amount of data to be
maintained very securely. These sites are used by the huge number of people and if any illegal
activity in sites will affect all these members. So Based on the some id proof such as student id, the
security can be improved.
Module 2: Dataset formulation:
7
Once the user start browsing, a session for the user is created automatically. If the user is valid then
the further TCP connections from the user will be sent to the server. A user can have any number of
connections in a single session. The user can also have any number of sessions at a time. All the
details of session arrival and TCP arrival processes are stored in log file or dataset of the server.
Module 3: session clustering:
The TCP flows in the session are cluster based on the user behavior in the data set. So the session
active time can be ON as much as the user needed. The sessions can not be inactive due to
clustering of same user sessions.
Data Fusion:
At the beginning of the data preprocessing, we have the Log containing the Web server log
files collected by several Web servers as well as the Web site maps (in XGML format [PK01], which is
an XML application). First, we join the log files and then anonymise the resulting log file for privacy
reasons.
Data cleaning:
There are a variety of files accessed as a result of a request by a client to view a particular Web
page. These include image, sound and video files, executable cgifiles , coordinates of clickable regions
in image map files and HTML files. Thus the server logs contain many entries that are redundant or
irrelevant for the data mining tasks
User Request: Page1.html
Browser Request: Page1.html, a. gif, b.gif
Entries for same user request in the Server Log, hence redundancy.
This procedure is straightforward, consisting in the removal of all the requests issued by pairs of (Host,
User Agent) identified as being a Web robot.
Data Summarization:
The last step of data preprocessing contains what we called the \advanced preprocessing step for
WUM". In this step, first, we transfer the structured file containing visits or episodes (if identified) to a
relational database. Afterwards, we apply the data generalization at the request level (for URLs) and the
aggregated data computation for episodes, visits and user sessions to completely fill in the database.
2.4 Software Specifications

Java Tools:
Langage
Java JDK1.6
Java Technologies
Swing, Servlets , JSP
IDE
NetBeans6.0
Operating System
Windows 2000/XP
3. SOFTWARE SYSTEM ATTRIBUTES

Reliability
The system shall fail under unavoidable circumstances like network connection failure.
Availability
The system shall allow users to restart the application after failure with the loss of the old Peer ID
along with its trust data and provides a new Peer ID by treating it as a new Peer
Security
The factors that can protect the software from accidental or malicious access, use, modification,
destruction, or disclosure provided are:
The software can be accessed by only authorized users since they are provided with user
name & password
The trust value about one peer is encrypted using public key cryptography and then
transferred to the other peer
Maintainability
Specify attributes of software that relate to the ease of maintenance of the software itself. There may
be some requirement for certain modularity, interfaces, complexity, etc. Requirements should not be
placed here just because they are thought to be good design practices. If someone else will maintain
the system
Portability
Since the software is to be developed using host independent code it can be executed in different host
with different platforms
4. IMPLEMENTATION
4.1 Java
Java was conceived by James Gosling, Patrick Naughton, Chris Warth, Ed Frank and Mike Sheridan
and SUN Micro Systems Incorporation in 1991. It took 18 months to develop the first working
version. This language was initially called "OAK", but was renamed "JAVA" in 1995. Before the
initial implementation of OAK in 1992 and the public announcement of Java in 1995, many more
contributed to the design and evolution of the language.
Java overview
10
Java is a powerful but lean object oriented programming language. It has generated a lot of excitement
because it make it possible to program for internet by creating applets, programs that can be embedded
in web page. The context of an applet is limited only by one's imagination. For example, an applet can
be animated with sound, an interactive game or a ticker tape with constantly updated stock prices.
Applets can be just little decoration to liven up web page, or they can be serious applications like word
processors or spreadsheet.
But Java is more than a programming language for writing applets. It is being used more
and more for writing standalone applications as well. It is becoming so popular that many people
believe it will become standard language for both general purpose and Internet programming.
Java builds on the strength of c++. It has taken best features of c++. It has added
garbage collection, multi threading and security capabilities. The result is that Java is actually a
platform and easy to use.
Java is actually a platform consisting of three components:
1. Java programming language
2. Java library of classes and interfaces
3. Java virtual machine
Java is portable
One of the biggest advantages java offers is that it is portable. An application written in java will run
on all the major platforms. Any computer with a java-based browser can run the applications or applets
written in java pr4ogramming languages. A programmer no longer has to write one program to run on
a UNIX machine, and so on. Developers write code once; Java code is compiled into byte codes rather
than a machine language. These byte codes go to the Java virtual machine, which executes them
directly into the language that is understood by the system.
Java is object oriented
11
Classes
A class is a combination of similar type. The combination of both data and the code of an object can
be made a user defined data type with the help of a class. A class defines shapes and behaviors of an
object and data. In class has been defined we can create any number of object belonging to that class.
A said already classes are user defined data types and behave likes the built in types of programming
language.
Data Abstraction
Data abstraction is an act of representing essential features without including the background details
and explanations.
Encapsulation
Data Encapsulation is one of the most striking features of java. Encapsulation is the wrapping up of
data and functions into a single unit called class. The wrapped defines the behavior and protects the
code and data from being arbitrarily accessed by the outside world and only those functions, which are
wrapped in the class, can access it. This type of insulation of data from direct access by the program is
called 'Data hiding'.
Inheritance
Inheritance is the process by which objects of a class can acquire the properties of objects of another
class i.e. in java the concept of inheritance provides idea of reusability providing the means of adding
additional features to an existing class without modifying it. This is possible by deriving a new class
from the existing on thus the newly created class will have the combined features of both the parent and
the child classes.
Polymorphism
12
Polymorphism means the ability to take more than one form i.e. one object, many shapes.
Polymorphism plays an important role in allowing objects having different internal structure to share
the same external interface. This states a general class if operations may be accessed in the same
manner ever though, specific actions with each operation may differ.
Dynamic Binding
Binding refers the linking of a procedure call to the code to be executed in response to the call.
Dynamic binding means that the code associated with a given procedure call is not known until the
time of the call at the run time.
Dynamic binding is closely associated with the concepts of polymorphic depends on the dynamic type
of that reference.
Java programming structure
A Java source files is a text file that contains one or more class definitions. The java compiler expects
these files to be stored with the '.java' filename extension. When Java source code is compiled, each
individual class is put into its own output file named after the class with a .class extension since there
is no global functions or variables in Java and only thing that can be in a Java, source file is one or
more class definitions. Java requires that all code reside inside of a names class. Java is highly case
sensitive with respect to all keywords and identifiers. In java the code for any method must be started
by an open brace and so ended by a close brace. Every java application must have a 'main' method. The
main method is simply a starting place for the interpreter to begin. Java applets won't use a main
method at all, since the web browser's java runtime has a different conversion for boot strapping
applets. In java every statement must end with a semicolon, there are no limits on the length of the
statements. Java is a free form language. Java programs can be written in any way there must be at
least one space between each token.
Java programs are collection of white space, comments,
keywords, identifiers, literals, operators and separators.

Packages and interfaces
13
Java allows to groups classes in a collection called packages.
Packages are convenient way of
organizing the classes and libraries. Packages can be nested. A number of classes having same kind of
behavior can be grouped under a package.
Packages are imported into the required java programs using the implements keyword.
Interfaces provide a mechanism that allows unrelated classes to implement the same set of methods.
An interface is a collection of method prototypes and constant values that are free from dependency on
a specific class. Interfaces are implemented by using the implements keyword.
4.2 Introduction to API
Application programming interface (API) forms the heart of any java program. These API'S are
defined in corresponding java packages and are imported to the program. Some of the packages
available in java are
* Java. Lang
- includes all language libraries
* Java.awt
- includes AWT libraries, such as windows, Scrollbars, etc.,

for GUI applications
* Java. Applet - includes API for applet programming

* Java.io
- includes all libraries required for input-output applications
* Java. Image -includes libraries for image processing.

* Java.net
-includes networking API's.
* Java.util
-includes general API's like vector, stack etc.
*Javax.swing -includes AWT libraries such as windows and GUI applications
4.3 Java Database Connectivity (JDBC)

The Java Database Connectivity (JDBC) is a standard java extension for data access that allows Java
programmers to code to a unified relational database API. By using JDBC, java programmer can
represent database connections; issue SQL statements, process database results, implemented by a
JDBC Drive, an adapter that known how to talk to a particular database in a proprietary way. JDBC is
similar to the Open Database Connectivity (ODBC) standard, and the two are quite interoperable
through JDBC-ODBC bridges.
14
JDBC
JDBC is the set of interfaces for connecting to SQL table. With JDBC, we can query and update the
data stored in the SQL tables within our Java programs. By this way, any Java object can be saved into
SQL tables. This Java API is essential for EJB, the core API in J2EE.
JDBC creates a programming-level interface for communicating with databases in a uniform
manner similar in concept to Microsoft's Open Database Connectivity (ODBC) component, which has
become the standard for personal computers and LANs. The JDBC standard itself is based on the
X/Open SQL Call Level Interface, the same basis as that of ODBC. This is one of the reasons why the
initial development of JDBC is progressing so fast.
Object classes for opening transactions with these databases are written completely in Java to allow
much closer interaction than you would get by embedding C language function calls in Java programs,
as you would have to do with ODBC. This way we can still maintain the security, the robustness, and
the portability that make Java so exciting.
However, to promote its use and maintain some level of backward compatibility, JDBC can be
implemented on top of ODBC and other common SQL APIs from vendors.
JDBC consists of two main layers: the JDBC API supports application-to-JDBC Manager
Communications; the JDBC Driver API supports JDBC Manager-to-Driver implementation
communications.
In terms of Java classes, the JDBC API consists of:
java.sql.Environment - allows the creation of new database connections;
java.sql.Connection
- connection-specific data structures;
java.sql.Statement
- container class for embedded SQL statements;
java.sql.ResultSet
- access control to results of a statement.
5. UML DIAGRAMS
Use case diagram:
15
Class Diagram:
16
Sequence Diagram:
17
18
6. TESTING
Introduction:
Testing is a schedule process carried out by the software development team to capture all the
possible errors, missing operations and also a complete verification to verify objective are met and user
requirement are satisfied. The design of tests for software and other engineering products can be as
challenging as the initial design to the product itself.
Testing Types:
A software engineering product can be tested in one of two ways:
Black box testing
White box testing
Black box testing:
knowing the specified function that a product has been designed to perform, determine whether
each function is fully operational.
White box testing:
knowing the internal workings of a software product determine whether the internal operation
implementing the functions perform according to the specification, and all the internal components
have been adequately exercised.
Testing Strategies:
Four Testing Strategies that are often adopted by the software development team include:
Unit Testing
Integration Testing
Validation Testing
System Testing
This system was tested using Unit Testing and Integration Testing
because there were the most relevant approaches for this project.
19
Strategies to test the project
Unit Testing:
We adopt white box testing when using this testing technique. This testing was carried out on
individual components of the software that were designed. Each individual module was tested using
this technique during the coding phase. Every component was checked to make sure that they adhere
strictly to the specifications spelt out in the data flow diagram and ensure that they perform the purpose
intended for them.
All the names of the variables are scrutinized to make sure that they are truly reflected of the
element they represent. All the looping mechanisms were verified to ensure that they were as decided.
Beside these, we trace through the code manually to capture syntax errors and logical errors.
Integration Testing:
After finishing the Unit Testing process, next is the integration testing process. In this testing
process we put our focus on identifying the interfaces between components and their functionality as
dictated by the DFD diagram. The Bottom up incremental approach was adopted during this testing.
Low level modules are integrated and combined as a cluster before testing.
The Black box testing technique was employed here. The interfaces between the components
were tested first. This allowed identifying any wrong linkages or parameters passing early in the
development process as it just can be passed in a set of data and checked if the result returned is an
accepted one.
Validation Testing:
Software testing and validation is achieved through a series of black box tests that demonstrate
conformity with requirements. A test procedure defines specific test cases that will be used to
demonstrate conformity with requirements. Both, the plan and the procedure are designed to ensure
that all functional requirements are achieved, documentation is correct and other requirements are met.
After each validation test case has been conducted, one of the two possible conditions exists. They are,
The function or performance characteristics conform to specification and are accepted.
A deviation from specification is uncovered and a deficiency list is created.
The deviation or error discovered at this stage in project can rarely be corrected prior to scheduled
completion. It is necessary to negotiate with the customer to establish a method for resolving
deficiencies.
20
System Testing:
System testing is a series of different tests whose primary purpose is to fully exercise the
computer based system. Although each test has a different purpose, all the work should verify that all
system elements have been properly integrated and perform allocated functions.
System testing also ensures that the project works well in the environment. It traps the errors
and allows convenient processing of errors without coming out of the program abruptly.
Recovery testing is done in such as way that failure is forced to a software system and checked
whether the recovery is proper and accurate. The performance of the system is highly effective.
Software testing is critical element of software quality assurance and represents ultimate review of
specification, design and coding. Test case design focuses on a set of technique for the creation of test
cases that meet overall testing objectives. Planning and testing of a programming system involve
formulating a set of test cases, which are similar to the real data that the system is intended to
manipulate. Test castes consist of input specifications, a description of the system functions exercised
by the input and a statement of the extended output. Through testing involves producing cases to
ensure that the program responds, as expected, to both valid and invalid inputs, that the program
perform to specification and that it does not corrupt other programs or data in the system.
In principle, testing of a program must be extensive. Every statement in the program should be
exercised and every possible path combination through the program should be executed at least once.
Thus, it is necessary to select a subset of the possible test cases and conjecture that this subset will
adequately test the program.
Approach to testing:
Testing a systems capabilities is more important that testing its component. This means that
test cases should be chosen to identify aspects of the system, which will stop them doing their job.
21
Testing old capabilities is more important than testing new capabilities. If program is a revision
of an existing system, users existing feature to keep working.
Testing typical situation is more important that testing boundary value cases. It is more important that
a system works under normal usage conditions than under occasional conditions, which only arise with
extreme data values.
Test data:
Test data are the inputs, which have been devised to test the system. It is sometimes possible
to generate test data automatically.Thedatas entered during the run time by the user is compared to the
data type given in the coding, if it is not equal then error is displayed in a message box and does not
allows the user to proceed further. Thus the dates are tested.
Algorithm Used for joining Log File:
First, we join the different log files from the Log. We put together the requests from all log files
into a joint log file. Generally, in the log files, the requests do not include the name of the server file.
However, we need the Web server name to distinguish between requests made to different Web servers,
therefore we add this information in the requests(before the file path). Moreover, we have to take into
account the synchronization of the Web server clocks, including the time zone differences. Figure 2.2
shows our algorithm for joining Web server log files. In this algorithm we used the following notations:
22
7. SCREEN SHOTS
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43

Session Mining Final

Загружено:

Сведения о документе

Исходное описание:

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Session Mining Final

Загружено:

Авторское право:

Доступные форматы

1.

connections is smaller than the

threshold value. Otherwise, the TCP connection is associated with a newly

capability offered by passive measurements performed at the packet level

2.2 Proposed System

to devise a technique that permits to correctly identify user-sessions,

The aim of this proposed system is to define a clustering technique to identify

to ensure its ability to correctly identify a set of TCP connections

to assess the error performance of the proposed technique, and

to compare it with traditional threshold based mechanisms.

amount of data transferred in a single session,

number of connections within a single session.

2) The Partitioned Approach:

2.4 Software Specifications

Swing, Servlets , JSP

3. SOFTWARE SYSTEM ATTRIBUTES

Java programs are collection of white space, comments,

keywords, identifiers, literals, operators and separators.

Java allows to groups classes in a collection called packages.

Packages are convenient way of

- includes all language libraries

- includes AWT libraries, such as windows, Scrollbars, etc.,

* Java. Applet - includes API for applet programming

- includes all libraries required for input-output applications

* Java. Image -includes libraries for image processing.

-includes networking API's.

-includes general API's like vector, stack etc.

*Javax.swing -includes AWT libraries such as windows and GUI applications

4.3 Java Database Connectivity (JDBC)

- connection-specific data structures;

- container class for embedded SQL statements;

- access control to results of a statement.

Use case diagram:

Strategies to test the project

Вам также может понравиться