Вы находитесь на странице: 1из 24

WEB-BASED DATA MINING IN ACADEMIC WEBSITES

WEB-BASED DATA MINING IN ACADEMIC WEBSITES

Guide: Mr. D. George Washington


Name: Prasanna Kumar Palepu
Reg No: 200536314

Abstract:

Proposed system is engaged in a discussion over applications of Web


mining to help in discovering pedagogically relevant knowledge contained in
databases obtained from Web-based educational systems. These findings can be
used both to help effective utilization of resources and minimization of web-
traffic, intruders. Analysis and reasoning of the mass of information in education
website are made by the technology of Web mining, which can dig out potential
modes reduce the risk and make right decisions.

The Intended goal is: -


 To mine the web log and find drawbacks in web sites
 To build an interface to analyze the web log.

Previous Status of The Project:


Worked on filtering the log file and keeping them in a database and updating it
day-by-day web log data.

Present Status of The Project:


 Designed database structure for log file.
 Collected IP to country database
 Collected GMT to country database
 Collected USER_Agent database
 Created User Interface design with UML diagrams
 Created reports format and table structures
 Generated a code for Parsing the Log file. Trying to eliminate bugs in it.

Guide: Mr. D. George Washington 17-Oct-2008


Prasanna Kumar Palepu (200536314) Page 1
WEB-BASED DATA MINING IN ACADEMIC WEBSITES

Architecture and Design:

Introduction:
This following page describes the system design in terms of packages, classes,
relationships, and behavior. Several attached worksheets address specific
aspects of the overall system design, such as user interface and database
design.

The most important facts of Design:-

This design is intended for helping in creating a rich interface for web
administrators to analyze the web log data and find anomalies in websites.

UML Structural Design


The system's structural design is described in the following UML model:
WebLogModelStructure

The system's structural design is described in the following UML structural


diagrams:

* PACKAGE WeblogModelStructure OVERVIEW DIAGRAM


* WebLogModel
o AddLog Diagram
o ParseLog Diagram
o ExportLog Diagram

UML Behavioral Design


The system's behavioral design is described in the following UML model:
WebLogModelBehavioral.

The system's design is described in the following UML diagrams:


 Referrer Statistics Class Diagram
 Access Statistics Class Diagram
 User Agent Statistics Class Diagram
 OuterView Of Project
 UML Activity Diagram

UML Design Checklist


Correctness: The generated Design is correct in its fullest and any modifications
in it will not lead to drastic change in entire system.

Feasibility: As per the Gantt chart the amount of time spend on design is
accurate and it is feasible.

Understandability: Since I am using Describe UML tool which is user-friendly and


easily understandable.

Guide: Mr. D. George Washington 17-Oct-2008


Prasanna Kumar Palepu (200536314) Page 2
WEB-BASED DATA MINING IN ACADEMIC WEBSITES

Implementation phase guidance: The designed modules are easily implemented.

Modularity: There is no particular software for parsing Web Log data and it is
unique. And this design comprises of all modules separated distinctly.

Extensibility: It is very easy to add new code to intended system as it is written


in VB.NET, which is user friendly.

Testability: It is very easy to test the system by Testing tools. Manual testing is
also done for verification and validation on each module individually and also on
whole.

Efficiency: The system consumes an acceptable amount of time, storage space,


bandwidth, and other resources.

Architecture Overview
Software architecture style is being used:
Single web service: app-server, database.
What are the ranked goals of this architecture?

1. Ease of integration
2. Extensibility
3. Capacity matching

Components
The components of this system:-
The components of this system are listed below by type:

* Presentation/UI Components
o C-00: WeblogUI
* Application Logic Components
o C-10: WebLogLogic
* Data Storage Components
o C-20: WebLogStorage

Deployment
The Components are deployed as follows:-

* All-in-one server
o WebLogFront End
+ C-00: WebLogUI
+ C-10: WebLogLogic
o Database process
+ C-20: WebLogStorage

Aspects/resources of their environment are shared as follows:


Everything is on one oracle server so all machine resources are shared by all
components.

Guide: Mr. D. George Washington 17-Oct-2008


Prasanna Kumar Palepu (200536314) Page 3
WEB-BASED DATA MINING IN ACADEMIC WEBSITES

The database will be updated constantly using export function.


The database could be moved to a different machine with a fairly simple change
to a configuration file. Otherwise, nothing can be changed about the
deployment. We have the ability to move the database process to a separate
machine. We have the ability to add more front-end servers. The application
logic running on the application server cannot be split or load-balanced.

Integration

The components are integrated and they communicate:-


All of our code uses direct procedure calls. The database is accessed
through a driver. Components within the same process use direct procedure call.
Communication with the database uses a ODBC driver. Communication between
the front end-and back-end servers uses ODBC.

Architectural Scenarios
The following sequence diagrams give step-by-step descriptions of how
components communicate during some important usage scenarios:

* System startup
* System shutdown
* ParsingLog
* ExportingLog

Architecture Checklist
Ease of integration: It uses the mechanisms been provided for all needed types
of integration and all of the new components are designed to work together.
And, the reused components are integrated via fairly simple interfaces.

Source Code Organization and Build System

Overview
It roughly follows the standard proposed in the Visual Studio .NET
documentation.

Ranked goals of this source code organization and build system:-

1. Separation of files by type


2. Separation of version-controlled files from files generated by the
build process
3. Compatibility with standard build processes

Guide: Mr. D. George Washington 17-Oct-2008


Prasanna Kumar Palepu (200536314) Page 4
WEB-BASED DATA MINING IN ACADEMIC WEBSITES

Key Directories and Files in Working Copies


Path Description
Logs/ Web Log File Directory For Parsing
Src/ VB.Net Source Files
src/Model/ VB.Net Model Form Source File
src/Report/ VB.Net Report Source File
src/VBNET/[Nested
VB.Net source code of classes in each package
packages]/
src/VBNET/[Nested VB.Net source code of unit tests for classes in each
packages]/test/ package
conf/ Configuration files,
data/ Initial data to load into database and/or file system
lib/ Libraries reused by this project
build/ Output of build process
help/ Project documents

Build Targets
Target Description
compile Compiles VB.NET source code and creates and creates an Executable file.
Load Loads the intended Log file into Application
This is the main target of the application, the log file has to be parsed
Parse
and stored in a temporary space.
It will export the parsed data to database and remove the temporary
Export
space used by it at the time of parsing.
Analyze Analyze the exported data from database.

Build Configuration Options


Property Description
This is the tool going to be created for exporting the raw web log
WebLogAnalysis
to database for analysis.
1.0 Version number of this release.

User Interface

Overview
The ranked goals for the user interface of this system:

Guide: Mr. D. George Washington 17-Oct-2008


Prasanna Kumar Palepu (200536314) Page 5
WEB-BASED DATA MINING IN ACADEMIC WEBSITES

1. Understandability and learn ability


2. Task support and efficiency
3. Safety
4. Consistency and familiarity

This UI design follows Microsoft UI guidelines.

Task Models
Only Web administrators will use this software for finding drawbacks in web site.

Technical Constraints / Operational Contextualization


Output devices:-
This “WebLogAnalyzer" system has a 320x200 16-color display as a model
window.
Windowing systems, UI libraries, or other UI technologies will you used:-
Standard .NET with no extra libraries.

User Interface Checklist


Understandability and learn ability
There are no misunderstanding by labels and icons used in this system as it uses
standard ones.
The advanced options clearly separated from the most commonly used options
There is no invisible options or commands
Safety
This is one way export process from front end to database. But still it we can
rollback using database administration.
Consistency and Familiarity
The UI elements in this system work the same as they do in the existing
example systems I identified. And all elements in this system that appear the
same, actually function the same.

Persistence

Central Database
Database access controls will be used:-
A database user account has been created that has access to the needed
application database tables. The username and password for this account
is stored in a configuration file read by the application server.
This application's central database accessible to other applications:-
No. This database should always be accessed through this application. All
relevant pieces of information are available through the application
interfaces. The database itself does not protect against data corruption
that could be caused by other applications.

File Storage
Nothing is stored in files, everything is in the database. The server stores most
data in the database; all user documents are stored in files on their computer
hard disk.

Persistence Mechanisms Checklist


Expressiveness: Database can easily understandable.

Guide: Mr. D. George Washington 17-Oct-2008


Prasanna Kumar Palepu (200536314) Page 6
WEB-BASED DATA MINING IN ACADEMIC WEBSITES

Ease of access: Database is accessible by login id and password only.


Reliability: The database is highly reliable.
Capacity: Database server is having more than 80GB free space.
Security: The database is highly secure.
Performance: Intel based systems with more than 512MB ram will work faster
for this system.

Physical structure of the Database:-


All tables described below are deployed in Oracle and they are normalized. Any
modification of database during will not give much impact in entire design of the
project.

Main_Parsed Table1:-
Field Name Data Type Length Description
Unique_ID AutoNumber 50 Unique Number to Identify the records.
This is the address of the computer making
Client_IP VARCHAR2 50
the HTTP request. The server records the IP
The field is designed to identify the
RFC_Name VARCHAR2 20 requestor. If this information is not recorded,
a hyphen (-) holds the column in the log.
If using local authentication and registration,
LogName VARCHAR2 20 the user's log name will appear; likewise, if
no value is present, a "-" is substituted.
The format is DD/Mon/YYYY:HH:MM:SS
Log_Date TIMESTAMP
+GMT
Req_method VARCHAR2 20 Request Method is GET, PUT, POST, or HEAD
Req_Path VARCHAR2 256 Path is the path and file retrieved
Req_Protocol VARCHAR2 20 It defines the protocol used by the Client
HTTP completion code. 200: OK 3xx: Some
Stat_Code VARCHAR2 3 sort of Redirection 4xx: Some sort of Client
Error 5xx: Some sort of Server Error
For GET HTTP transactions, this field is the
number of bytes transferred. For other
Req_Bytes VARCHAR2 10
commands this field will be a hyphen (-) or a
zero (0)
The referrer URL indicates the page where
Referrer VARCHAR2 50 the visitor was located when making the next
request.
The user agent is information about the
User_agent VARCHAR2 200 browser, version, and operating system of
the reader. The general format is:

GMT Table2:-
Field Name Data Type Length Description
GMT SMALLINT 5 Greenwich Mean Time in number format
Zone VARCHAR2 2 Zone of the GMT

Guide: Mr. D. George Washington 17-Oct-2008


Prasanna Kumar Palepu (200536314) Page 7
WEB-BASED DATA MINING IN ACADEMIC WEBSITES

Military_Code VARCHAR2 10 Millitary Code for the Time Zone


Country VARCHAR2 15 Country Name
City VARCHAR2 15 City Name

IP2Country Table3:-
Field Name Data Type Length Description
Starting IP address (Numerical
IP_From NUMBER 12
representation of IP address)
Ending IP address (Numerical representation
IP_To NUMBER 12
of IP address.)
This is having reserved address numbers. It
Registry VARCHAR2 10 contains “apcnic, arin, lacnic, ripencc,
afrinic”
Country_Code VARCHAR2 3 Code of the country
Country VARCHAR2 20 Full Description of the country

IP Example: (from Right to Left)


1.2.3.4 = 4 + (3 * 256) + (2 * 256 * 256) + (1 * 256 * 256 * 256)= 16909060

User_agent Table4:-
Field Name Data Type Length Description
User Agent String with all information
U_Agent_String VARCHAR2 100
about the Client system.
U_Agent_Type VARCHAR2 2 S-Spiders, R-Robots, C-Crawler, B-Browser
Browser VARCHAR2 10 Browser Version
Platform VARCHAR2 10 Platform of User

Req_Resourse Table5:-
Field Name Data Type Length Description
Req_URL VARCHAR2 100 Requested URL path
Req_File VARCHAR2 50 Requested file
Req_Bytes NUMBER 10 Requested file Size in bytes

Status_Code Table6:-
Field Name Data Type Length Description
Stat_Code NUMBER 3 HTTP completion code.
200: OK 3xx: Some sort of Redirection 4xx:
Stat_C_Desc VARCHAR2 25 Some sort of Client Error 5xx: Some sort of
Server Error

Host_Summary Table7:-
Field Name Data Type Length Description
This is the address of the computer
Client_IP VARCHAR2 50 making the HTTP request. The server
records the IP

Guide: Mr. D. George Washington 17-Oct-2008


Prasanna Kumar Palepu (200536314) Page 8
WEB-BASED DATA MINING IN ACADEMIC WEBSITES

Country_Code VARCHAR2 3 Code of the country


The number of times client visited the
No_Of_Occurances NUMBER 5
website.
The number of times client visited the
No_Of_Pages NUMBER 5
webpages.
Bandwidth NUMBER 10 Bandwidth in bytes
Date DATETIME Date the client visited the website.

Referrar_Code Table8:-
Field Name Data Type Length Description
Ref_URL VARCHAR2 100 Referral URL
Ref_Site VARCHAR2 100 Referring WebSite
Keywords used to search the content in
Key_Word1 VARCHAR2 20
website
Keywords used to search the content in
Key_Word2 VARCHAR2 20
website
Keywords used to search the content in
Key_Word3 VARCHAR2 20
website
Keywords used to search the content in
Key_Word4 VARCHAR2 20
website
Keywords used to search the content in
Key_Word5 VARCHAR2 20
website
Search_Engine VARCHAR2 20 Name of the Search Engine
Dom_Name VARCHAR2 5 Name of the Domain

Guide: Mr. D. George Washington 17-Oct-2008


Prasanna Kumar Palepu (200536314) Page 9
WEB-BASED DATA MINING IN ACADEMIC WEBSITES

UML Activity Diagram

Parse Log
Data

Parsing Time Parsing


Finding Parsing the Parsing
Zone by Parsing User Agent
Country by Arguements Status
splitting the Referrer Details
IP Address in Request Code
date time and
GMT Field

Update in
database

Guide: Mr. D. George Washington 17-Oct-2008


Prasanna Kumar Palepu (200536314) Page 10
WEB-BASED DATA MINING IN ACADEMIC WEBSITES

OuterView Of Project

Access_Stats

Host_Stats
WebAdmin

Referrer_Stats

User_Agent_Stats

UserAgent Class Diagram

User_Agent
Attributes
Private U_Agent_URL As Character
Private Type As Character
Private Browser As Character
Private Platform As Character
Operations
Public Function Class_Initialize()
Public Function getU_Agent_URL() As Character
Public Sub setU_Agent_URL( val As Character )
Public Function getType() As Character
Public Sub setType( val As Character )
Public Function getBrowser() As Character
Public Sub setBrowser( val As Character )
Public Function getPlatform() As Character
Public Sub setPlatform( val As Character )

U_A_OS U_A_Browser
Attributes Attributes
Private NoOfHits As Integer Private NoOfHits As Integer
Private Bandwidth As Integer Private Bandwidth As Integer
Private NoOfPages As Integer Private NoOfPages As Integer
Operations Operations
Public Function Class_Initialize() Public Function Class_Initialize()
Public Function getNoOfHits() As Integer Public Function getNoOfHits() As Integer
Public Sub setNoOfHits( val As Integer ) Public Sub setNoOfHits( val As Integer )
Public Function getBandwidth() As Integer Public Function getBandwidth() As Integer
Public Sub setBandwidth( val As Integer ) Public Sub setBandwidth( val As Integer )
Public Function getNoOfPages() As Integer Public Function getNoOfPages() As Integer
Public Sub setNoOfPages( val As Integer ) Public Sub setNoOfPages( val As Integer )

Guide: Mr. D. George Washington 17-Oct-2008


Prasanna Kumar Palepu (200536314) Page 11
WEB-BASED DATA MINING IN ACADEMIC WEBSITES

Access Statistics Class Diagram

ClientRequests
Attributes
Private RequestedFile As Character
Private ReqestedURL As Character
Private RequestedBytes As Character
Private ClientIP As Character
Operations
Public Function getRequestedFile() As Character
Public Sub setRequestedFile( val As Character )
Public Function getReqestedURL() As Character
Public Sub setReqestedURL( val As Character )
Public Function getRequestedBytes() As Character
Public Sub setRequestedBytes( val As Character )
Public Function getClientIP() As Character
Public Sub setClientIP( val As Character )
Public Function Class_Initialize()

By_Pages By_ResponseCode
By_Files By_Paths
{ From Access_Stats } Attributes
Attributes
Attributes Attributes Private NoOfVisitors As Integer
Private NofOfVisitors As Integer Private Bandwidth As Integer
Private NoOfVisitors As Integer Private NoOfVisitors As Integer
Private Bandwidth As Integer Private NoOfHits As Integer
Private Bandwidth As Integer Private NoOfHits As Integer
Private NoOfHits As Integer
Private NoOFHits As Integer Private Bandwidth As Integer Operations
Operations
Operations Operations Public Function Class_Initialize()
Public Function Class_Initialize() Public Function getNoOfVisitors() As Integer
Public Function Class_Initialize() Public Function Class_Initialize()
Public Function getNofOfVisitors() As Integer Public Sub setNoOfVisitors( val As Integer )
Public Function getNoOfVisitors() As Integer Public Function getNoOfVisitors() As Integer
Public Sub setNofOfVisitors( val As Integer ) Public Function getBandwidth() As Integer
Public Sub setNoOfVisitors( val As Integer ) Public Sub setNoOfVisitors( val As Integer )
Public Function getBandwidth() As Integer Public Sub setBandwidth( val As Integer )
Public Function getBandwidth() As Integer Public Function getBandwidth() As Integer
Public Sub setBandwidth( val As Integer ) Public Function getNoOfHits() As Integer
Public Sub setBandwidth( val As Integer ) Public Sub setBandwidth( val As Integer )
Public Function getNoOfHits() As Integer Public Sub setNoOfHits( val As Integer )
Public Function getNoOFHits() As Integer Public Function getNoOfHits() As Integer
Public Sub setNoOfHits( val As Integer )
Public Sub setNoOFHits( val As Integer ) Public Sub setNoOfHits( val As Integer )

Guide: Mr. D. George Washington 17-Oct-2008


Prasanna Kumar Palepu (200536314) Page 12
WEB-BASED DATA MINING IN ACADEMIC WEBSITES

Referrer Statistics Class Diagram

ReferrerStats
Attributes
Private ReferrerURL As Character
Private RefSite As Character
Private Keyword1 As Character
Private Keyword2 As Character
Private Search_Engine As Character
Private Dom_Name As Character
Operations
Public Function Class_Initialize()
Public Function getReferrerURL() As Character
Public Sub setReferrerURL( val As Character )
Public Function getRefSite() As Character
Public Sub setRefSite( val As Character )
Public Function getKeyword1() As Character
Public Sub setKeyword1( val As Character )
Public Function getKeyword2() As Character
Public Sub setKeyword2( val As Character )
Public Function getSearch_Engine() As Character
Public Sub setSearch_Engine( val As Character )
Public Function getDom_Name() As Character
Public Sub setDom_Name( val As Character )

ByRef_Site
By_Keyword By_SearchEngine
Attributes
Attributes Attributes
Private NoOfHits As Integer
Private Bandwidth As Integer Private NoOfHits As Integer Private NoOfHits As Integer
Private NoOfPages As Integer Private Bandwidth As Integer Private NoOfPages As Integer
Private NoOfPages As Integer Private Bandwidth As Integer
Operations
Operations Operations
Public Function Class_Initialize()
Public Function getNoOfHits() As Integer Public Function Class_Initialize() Public Function Class_Initialize()
Public Sub setNoOfHits( val As Integer ) Public Function getNoOfHits() As Integer Public Function getNoOfHits() As Integer
Public Function getBandwidth() As Integer Public Sub setNoOfHits( val As Integer ) Public Sub setNoOfHits( val As Integer )
Public Sub setBandwidth( val As Integer ) Public Function getBandwidth() As Integer Public Function getBandwidth() As Integer
Public Function getNoOfPages() As Integer Public Sub setBandwidth( val As Integer ) Public Sub setBandwidth( val As Integer )
Public Sub setNoOfPages( val As Integer ) Public Function getNoOfPages() As Integer Public Function getNoOfPages() As Integer
Public Sub setNoOfPages( val As Integer ) Public Sub setNoOfPages( val As Integer )

Guide: Mr. D. George Washington 17-Oct-2008


Prasanna Kumar Palepu (200536314) Page 13
WEB-BASED DATA MINING IN ACADEMIC WEBSITES

A normal web log is a raw file as follows:


Here 1,2,3 and 4 are line number representation

1. 65.55.208.12 - - [09/Sep/2007:04:13:04 +0530] "GET /academic/curri2002-


ft-welding.doc HTTP/1.0" 200 52224 "-" "msnbot/1.0
(+http://search.msn.com/msnbot.htm)"
2. 74.6.28.105 - - [09/Sep/2007:04:13:17 +0530] "GET /academic/D508.doc
HTTP/1.0" 304 - "-" "Mozilla/5.0 (compatible; Yahoo! Slurp;
http://help.yahoo.com/help/us/ysearch/slurp)"
3. 69.123.246.252 - - [09/Sep/2007:04:13:33 +0530] "GET
/images/newlogo.jpg HTTP/1.1" 304 -
"http://collinfo.annauniv.edu:6060/annauniv/courseall/branchwise.asp?brnam
e=B.E-Bio-Medical Engineering&brcode=121&degrcode=11" "Mozilla/4.0
(compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)"
4. 69.123.246.252 - - [09/Sep/2007:04:13:33 +0530] "GET
/images/annatext.gif HTTP/1.1" 304 -
"http://collinfo.annauniv.edu:6060/annauniv/courseall/branchwise.asp?brnam
e=B.E-Bio-Medical Engineering&brcode=121&degrcode=11" "Mozilla/4.0
(compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)"

Format Of log File:

<ip_addr><base_url>-
<date><method><file><protocol><code><bytes><referrer><user_agent>

Fields:
Client IP: 128.101.228.20
Authenticated User ID: - -
Time/Date: [10/Nov/1999:10:16:39 -0600]
Request: "GET / HTTP/1.0" (Other common methods are POST and HEAD)
Status: 200 (– 200: OK – 3xx: Some sort of Redirection – 4xx: Some sort of
Client Error– 5xx: Some sort of Server Error)
Bytes: -
Referrer: “-”
Agent: "Mozilla/4.61 [en] (WinNT; I)"

Common Log Format:

Remotehost: browser hostname or IP #


Remote log name of user (almost always "-" meaning "unknown")
Authuser: authenticated username
Date: Date and time of the request
"request”: exact request lines from client
Status: The HTTP status code returned
Bytes: The content-length of response

Guide: Mr. D. George Washington 17-Oct-2008


Prasanna Kumar Palepu (200536314) Page 14
WEB-BASED DATA MINING IN ACADEMIC WEBSITES

Sample Reports

Access Statistics
Pages
Hits Visitors Bandwidth
Page
% % %
1 / 166 27.48 144 26.33 2.30 MB 30.75
2 /coe/schedule.htm 61 10.10 53 9.69 615.87 KB 8.04
3 /result/results_revs.html 32 5.30 31 5.67 117.11 KB 1.53
4 /academic/ 25 4.14 23 4.20 137.72 KB 1.80

Entry Points
Hits Visitors Bandwidth
Entry Point
% % %
1 / 135 57.45 135 57.45 2.22 MB 86.84
2 /academic/ 15 6.38 15 6.38 54.74 KB 2.09
3 /academic 9 3.83 9 3.83 2.85 KB 0.11
4 /academic/lakescr.txt 8 3.40 8 3.40 8 0.00

Paths
Visitors Bandwidth
Path
% %
1 No Referrer -> / 53 22.55 721.30 KB 9.42
2 No Referrer -> / -> /coe/schedule.htm 16 6.81 549.09 KB 7.17
3 No Referrer -> / -> /result/results_revs.html 11 4.68 249.79 KB 3.26
4 No Referrer -> /academic/ 10 4.26 2.88 KB 0.04

Hits Visitors Pages Bandwidth


File Type
% % % %
1 .gif 1616 40.34 173 15.27 196 31.01 3.93 MB 7.08
2 .jpg 653 16.30 177 15.62 63 9.97 4.36 MB 7.86
3 .html 440 10.98 221 19.51 73 11.55 3.63 MB 6.54

Hits Visitors Pages Bandwidth


Response Code
% % % %
1 200 - OK 2415 60.28 240 44.53 566 71.37 44.63 MB 80.48
2 304 - Not Modified 1057 26.39 109 20.22 140 17.65 0 0.00
3 404 - Not Found 411 10.26 120 22.26 46 5.80 119.31 KB 0.21
5 301 - Moved Permanently 22 0.55 22 4.08 6 0.76 6.90 KB 0.01
6 405 - Method Not Allowed 15 0.37 15 2.78 2 0.25 4.78 KB 0.01

Visitor Statistics
Hosts

Guide: Mr. D. George Washington 17-Oct-2008


Prasanna Kumar Palepu (200536314) Page 15
WEB-BASED DATA MINING IN ACADEMIC WEBSITES

Hits Pages Bandwidth


Host Country
% % %
1 122.164.245.135 India 128 3.20 54 1.83 499.78 KB 0.88
2 121.246.25.137 India 121 3.02 86 2.92 352.62 KB 0.62
3 59.92.9.1 India 119 2.97 116 3.93 514.89 KB 0.91

Visitors
Hits Pages Bandwidth
Visitors Country
% % %
1 122.164.245.135 India 128 3.20 54 1.83 499.78 KB 0.88
2 121.246.25.137 India 121 3.02 86 2.92 352.62 KB 0.62
4 122.164.169.105 India 113 2.82 67 2.27 1.02 MB 1.84

Hits Visitors Pages Bandwidth


Country
% % % %
1 India 3882 96.90 278 89.39 599 85.21 45.54 MB 82.11
2 United States 74 1.85 22 7.07 59 8.39 8.36 MB 15.08
3 Kuwait 25 0.62 1 0.32 23 3.27 162.64 KB 0.29

Referrers Statistics
Hits Visitors Pages Bandwidth
Referrer
% % % %
1 http://www.annauniv.edu / 1134 28.31 143 15.29 25 2.55 5.95 MB 10.73
17.25
2 No Referrer 553 13.80 249 26.63 104 10.61 31.10
MB
http://www.annauniv.edu /coe
3 457 11.41 59 6.31 19 1.94 2.25 MB 4.05
/schedule.htm
http://www.annauniv.edu /coe
4 197 4.92 19 2.03 18 1.84 1.11 MB 2.00
/circular.html

Referring Sites
Hits Visitors Pages Bandwidth
Referring Site
% % % %
1 http://www.annauniv.edu / 3311 82.65 195 38.09 548 76.97 33.38 MB 60.19
2 No Referrer 553 13.80 249 48.63 104 14.61 17.25 MB 31.10
3 http://collinfo.annauniv.edu :6060 / 68 1.70 19 3.71 8 1.12 196.97 KB 0.35
4 http://www.google.co.in / 25 0.62 21 4.10 17 2.39 1.98 MB 3.57
5 http://www.google.com / 13 0.32 8 1.56 6 0.84 663.29 KB 1.17

Keywords
Hits Visitors Pages Bandwidth
Keyword SE Page
% % % %
1 anna university 1 11 28.95 7 22.58 3 13.04 153.23 KB 34.60
2 annauniversity 1 5 13.16 3 9.68 1 4.35 42.43 KB 9.58

Guide: Mr. D. George Washington 17-Oct-2008


Prasanna Kumar Palepu (200536314) Page 16
WEB-BASED DATA MINING IN ACADEMIC WEBSITES

3 annauniv.edu 1 4 10.53 3 9.68 2 8.70 76.75 KB 17.33


4 annauniv 1 2 5.26 2 6.45 1 4.35 42.43 KB 9.58

Hits Visitors Pages Bandwidth


Search Engine SE Page
% % % %
1 Google.com 1-2 28 70.00 22 70.97 13 68.42 306.83 KB 68.96
2 Yahoo.com 1 8 20.00 6 19.35 2 10.53 84.86 KB 19.07
3 MSN.com 1 3 7.50 2 6.45 3 15.79 32.00 KB 7.19
4 live.com 1 1 2.50 1 3.23 1 5.26 21.21 KB 4.77

User Agent Stats


Hits Visitors Pages Bandwidth
Operating System
% % % %
1 Windows XP 3154 78.99 185 62.08 452 55.73 40.86 MB 83.90
2 Windows 2000 449 11.24 32 10.74 207 25.52 2.74 MB 5.62
3 Windows 98 277 6.94 12 4.03 83 10.23 4.35 MB 8.93
4 Unknown 65 1.63 62 20.81 28 3.45 498.96 KB 1.00
5 Linux 16 0.40 3 1.01 14 1.73 66.85 KB 0.13

Hits Visitors Pages Bandwidth


Browser
% % % %
1 MS Internet Explorer 6 2707 68.85 163 68.49 453 46.89 32.22 MB 66.89
2 Firefox 597 15.18 30 12.61 245 25.36 4.20 MB 8.71
3 MS Internet Explorer 7 368 9.36 19 7.98 141 14.60 10.20 MB 21.17
4 MS Internet Explorer 5 106 2.70 5 2.10 53 5.49 253.71 KB 0.51
5 Opera 9 52 1.32 5 2.10 37 3.83 318.61 KB 0.65

Error Stats
Errors
Hits
Error
%
/coe/TITLEflowers.gif
1 97 22.77
http://www.annauniv.edu /coe /schedule.htm
/favicon.ico
2 87 20.42
No Referrer
/coe/fd_1.jpg
3 35 8.22
http://www.annauniv.edu /coe /top.htm
/campustour/images/leftboxcorner_top.gif
4 27 6.34
http://www.annauniv.edu /campustour /index.htm
/academic/
5 15 3.52
No Referrer

Hits
Error
%

Guide: Mr. D. George Washington 17-Oct-2008


Prasanna Kumar Palepu (200536314) Page 17
WEB-BASED DATA MINING IN ACADEMIC WEBSITES

1 404 - Not Found 411 96.48


2 405 - Method Not Allowed 15 3.52

Sample Code

#include <assert.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#ifndef _DEBUG
#define PRIVATE static
#else
#define PRIVATE
#endif

#define MAX_FILE_SPECS (10)


#define INITIAL_BUFFER_LEN (100)

PRIVATE struct log_entry_filter log_filter;


PRIVATE char* file_specs[MAX_FILE_SPECS];
PRIVATE void filter_file(FILE* log_file);
PRIVATE void parse_command_line(int argc, char** argv);
PRIVATE void execute_all_tests(void);
PRIVATE char* all_tests(void);
PRIVATE void read_file_specs_from_cl(int argc, char* argv[]);
PRIVATE void filter_files(glob_t* glob);
PRIVATE void free_file_specs(void);
PRIVATE void print_version(void);
PRIVATE void filter_file_specs(void);

int main(int argc, char** argv)


{
parse_command_line(argc, argv);
if (file_specs[0] != NULL)
{
filter_file_specs();
free_file_specs();
}
else
{
filter_file(stdin);
}
filter_free(&log_filter);
return EXIT_SUCCESS;
}

PRIVATE
void
filter_file(FILE* log_file)

Guide: Mr. D. George Washington 17-Oct-2008


Prasanna Kumar Palepu (200536314) Page 18
WEB-BASED DATA MINING IN ACADEMIC WEBSITES

{
struct log_file_entry* entry;
char* line = NULL;
size_t length = INITIAL_BUFFER_LEN;

line = buffer_allocate(line, INITIAL_BUFFER_LEN + 1);

while (getline(&line, &length, log_file) != -1)


{
assert(line != NULL);
entry = parse_line(line);
if (entry)
{
if (filter_entry(&log_filter, entry))
{
fputs(line, stdout);
}
free_entry(entry);
}
}
free(line);
}

PRIVATE
void
free_file_specs(void)
{
int counter = 0;

while (file_specs[counter] != NULL)


{
free(file_specs[counter]);
counter++;
}
}

PRIVATE
void
filter_file_specs(void)
{
int counter = 0;
int flags = 0;
int status;
glob_t glob_buf;

assert(file_specs[0] != NULL);

while (file_specs[counter] != NULL)


{
status = glob(file_specs[counter], flags, NULL, &glob_buf);
switch (status)
{

Guide: Mr. D. George Washington 17-Oct-2008


Prasanna Kumar Palepu (200536314) Page 19
WEB-BASED DATA MINING IN ACADEMIC WEBSITES

case GLOB_NOSPACE:
// Out of memory error
exit_with_diagnostic("Ran out of memory whilst globbing...\n");
break;

case GLOB_NOMATCH:
// The pattern didn't match any files
exit_with_diagnostic("No files match file spec\n");
break;

default:
// Everything went ok, just carry on...
break;
}
flags |= GLOB_APPEND;
counter++;
}
assert(glob_buf.gl_pathc > 0);
filter_files(&glob_buf);
globfree(&glob_buf);
}

PRIVATE
void
filter_files(glob_t* glob)
{
int i;
FILE* log_file;

for (i = 0; i < glob->gl_pathc; ++i)


{
log_file = fopen(glob->gl_pathv[i], "r");
if (!log_file)
{
exit_with_diagnostic("Unable to open log file\n");
}
filter_file(log_file);
fclose(log_file);
}
}

PRIVATE
void
usage(void)
{
exit_with_diagnostic(
"usage: " PACKAGE_NAME " [-hiTv] [-b browser] [-c client] [-f filter(s)]\n"
" [-I identity] [-m method] [-p protocol] [-r referer] [-s status]\n"
" [-u uri] [-U user] [-z size] logfile [logfile...]\n"
"\n"
" -b browser filter for user agent (browser) string\n"
" -c client filter for client address\n"

Guide: Mr. D. George Washington 17-Oct-2008


Prasanna Kumar Palepu (200536314) Page 20
WEB-BASED DATA MINING IN ACADEMIC WEBSITES

" -h get usage message\n"


" -i do case-insensitive string searches\n"
" -I identity ?? filter on second field of log file\n"
" -m method filter on request method (e.g. GET, POST...)\n"
" -p protocol filter on HTTP protocol version field (e.g. HTTP/1.1)\n"
" -r referer filter on document referer string\n"
" -s status filter on request status value (e.g. 200, 404...)\n"
" -T run internal test suite\n"
" -u uri filter on document URI\n"
" -U user filter on user name used in request, if any\n"
" -v show program's version number\n"
" -z size filter on document size\n"
"\n");
}

PRIVATE
void
parse_command_line(int argc, char** argv)
{
int choice;
if (argc <= 1)
{
usage();
}
memset(file_specs, 0, MAX_FILE_SPECS * sizeof(char*));
while (((choice = getopt(argc, argv, "b:c:hiTI:m:p:r:s:tu:U:vz:")) != -1))
{
switch (choice)
{
case 'b':
save_ua_filter(&log_filter, optarg);
break;

case 'c':
save_client_filter(&log_filter, optarg);
break;

case 'h':
usage();
break;

case 'i':
// Perform case insensitive matches
case_sensitive = 0;
break;

case 'I':
save_identity_filter(&log_filter, optarg);
break;

case 'm':
save_method_filter(&log_filter, optarg);

Guide: Mr. D. George Washington 17-Oct-2008


Prasanna Kumar Palepu (200536314) Page 21
WEB-BASED DATA MINING IN ACADEMIC WEBSITES

break;

case 'p':
save_protocol_filter(&log_filter, optarg);
break;

case 'r':
save_referer_filter(&log_filter, optarg);
break;

case 's':
save_status_filter(&log_filter, optarg);
break;

case 'T':
execute_all_tests();
break;

case 'u':
save_uri_filter(&log_filter, optarg);
break;

case 'U':
save_user_id_filter(&log_filter, optarg);
break;

case 'v':
print_version();
break;

case 'z':
save_size_filter(&log_filter, optarg);
break;

default:
usage();
exit_with_diagnostic("\nUnknown command line option");
break;
}
}
read_file_specs_from_cl(argc, argv);
}

PRIVATE
void
read_file_specs_from_cl(int argc, char* argv[])
{
int cl_counter;
int file_spec_counter = 0;
char* file_spec;

assert(file_specs[0] == NULL);

Guide: Mr. D. George Washington 17-Oct-2008


Prasanna Kumar Palepu (200536314) Page 22
WEB-BASED DATA MINING IN ACADEMIC WEBSITES

for (cl_counter = optind; cl_counter < argc; ++cl_counter)


{
file_spec = malloc(strlen(argv[cl_counter]) + 1);
if (!file_spec)
{
exit_with_diagnostic("Failed to allocate buffer for file spec");
}
strcpy(file_spec, argv[cl_counter]);
file_specs[file_spec_counter++] = file_spec;
}
}

PRIVATE
void
print_version(void)
{
printf("%s version %s\n", PACKAGE_NAME, VERSION);
exit(EXIT_SUCCESS);
}

PRIVATE
char*
all_tests(void)
{
mu_run_test(entry_all_tests);
mu_run_test(filter_all_tests);
return 0;
}

PRIVATE
void
execute_all_tests(void)
{
int exit_code = EXIT_SUCCESS;
char *result;

result = all_tests();

if (result != 0)
{
printf("%s\n", result);
exit_code = EXIT_FAILURE;
}
else
{
printf("ALL TESTS PASSED\n");
}
printf("Tests run: %d\n", tests_run);
exit(exit_code);
}

Guide: Mr. D. George Washington 17-Oct-2008


Prasanna Kumar Palepu (200536314) Page 23
WEB-BASED DATA MINING IN ACADEMIC WEBSITES

Reference:

[1] Vranic, M.Pintar, D. Skocir, "The use of data mining in education environment"
in 9th International Conference on Telecommunications, 2007. ConTel 2007;
June 2007; PP: 243-250
[2] Qianhui Althea LIANG , Jen-Yao CHUNG , Steven MILLER , Yang OUYANG;
"Service Pattern Discovery of Web Service Mining in Web Service Registry-
Repository" in IEEE International Conference on e-Business Engineering
(ICEBE'06); October 2006
[3] Georgios Lappas; "An Overview of Web Mining in Societal Benefit Areas" in
The 9th IEEE International Conference on E-Commerce Technology and The
4th IEEE International Conference on Enterprise Computing, E-Commerce
and E-Services (CEC-EEE 2007); July 2007; pp. 683-690
[4] Hafidh Ba-Omar , Ilias Petrounias , Fahad Anwar; "A Framework for Using
Web Usage Mining to Personalise E-learning" in Seventh IEEE International
Conference on Advanced Learning Technologies (ICALT 2007); July 2007; pp.
937-938
[5] Leticia dos Santos Machado , Karin Becker; "Distance Education: A Web
Usage Mining Case Study for the Evaluation of Learning Sites" In Third IEEE
International Conference on Advanced Learning Technologies (ICALT'03); July
2003; pp. 360
[6] Carlos G. Marquardt , Karin Becker , Duncan D. Ruiz; "A Pre-Processing Tool
for Web Usage Mining in the Distance Education Domain" in International
Database Engineering and Applications Symposium (IDEAS'04); July 2004;
pp. 78-87
[7] Xiangzhu Gao , San Murugesan , Bruce Lo; "Extraction of Keyterms by Simple
Text Mining for Business Information Retrieval" in IEEE International
Conference on e-Business Engineering (ICEBE'05); October 2005; pp. 332-
339
[8] Ajith Abraham; "Natural Computation for Business Intelligence from Web
Usage Mining" in Seventh International Symposium on Symbolic and Numeric
Algorithms for Scientific Computing (SYNASC'05); September 2005; pp. 3-10

Guide: Mr. D. George Washington 17-Oct-2008


Prasanna Kumar Palepu (200536314) Page 24

Вам также может понравиться