Вы находитесь на странице: 1из 12

Web Mining L,T,P,J,C

Subject Code: CSE3024


3,0,2,0,4
v.1.1
Objectives  To focus on a detailed overview of the web mining process and its
techniques
 To Understand the basics of Web search with special emphasis on web
Crawling
 To understand the basic of indexing and the various type of query
processing approaches.
 To appreciate the use of machine learning approaches for Web Content
Mining
 To understand the role of hyper links in web structure mining
 To appreciate the various aspects of web usage mining

Expected Outcome Upon Completion of the course, the students will be able to
 Build a sample search engine using available open source tools
 Describe the browser security model in web security
 Identify the different components of a web page that can be used for
mining
 Apply machine learning concepts to web content mining
 Implement Page Ranking algorithm and modify the algorithm for mining
information
 Design a system to harvest information available on the web to build
recommender systems
 Analyse social media data using appropriate data/web mining techniques
 Modify an existing search engine to make it personalized

Module Topics L Hrs SLO


1 INTRODUCTION
Introduction of WWW – Architecture of the WWW – Web Document
Representation- Web Search Engine – Challenges - Web security
overview and concepts, Web application security, Basic web security 5 2
model -Web Hacking Basics HTTP & HTTPS URL, Web Under the
Cover Overview of Java security Reading the HTML source.
2 WEB CRAWLING
Basic Crawler Algorithm: Breadth-First/ depth-First Crawlers, -
Universal Crawlers- Preferential Crawlers : Focused Crawlers - Topical 5 7, 1
Crawlers.
3 INDEXING 5 2
Static and Dynamic Inverted Index– Index Construction and Index
Compression- Latent Semantic Indexing. Searching using an Inverted
Index: Sequential Search - Pattern Matching - Similarity search.

4 WEB STRUCTURE MINING


Link Analysis - Social Network Analysis - Co-Citation and
Bibliographic Coupling - Page Rank- Weighted Page Rank- HITS -
8 7, 1
Community Discovery - Web Graph Measurement and Modelling-
Using Link Information for Web Page Classification.

5 WEB CONTENT MINING


Classification: Decision tree for Text Document- Naive Bayesian Text
Classification - Ensemble of Classifiers. Clustering: K-means
Clustering - Hierarchical Clustering – Markov Models - Probability- 8 7, 1
Based Clustering. Vector Space Model – Latent semantic Indexing –
Automatic Topic Extraction from Web Documents.
6 WEB USAGE MINING
Web Usage Mining - Click stream Analysis - Log Files - Data
Collection and Pre-Processing - Data Modelling for Web Usage Mining
- The BIRCH Clustering Algorithm - Modelling web user interests
using clustering- Affinity Analysis and the A Priori Algorithm – 9 7, 1
Binning –Web usage mining using Probabilistic Latent Semantic
Analysis – Finding User Access Pattern via Latent Dirichlet Allocation
Model.

7 QUERY PROCESSING
Relevance Feedback and Query Expansion - Automatic Local and
3 11
Global Analysis – Measuring Effectiveness and Efficiency

8 Recent Trends 2

Lab (Indicative List of Experiments in the areas of ) 60

1. To develop the Search Engine for retrieval process


Develop the search engine that crawls, transforms and index information for
retrieval and presentation in response to user queries
2. To develop the Crawler based on domains
Develop the Web crawlers that can copy all the pages they visit for later
processing by a search engine which indexes the downloaded pages so the users can
search much more efficiently.
3. Extract textual information and Multimedia contents from documents
Efficiently extract the related textual information and Multimedia contents from
documents using web content, web structure and web usage mining

4. Develop Search engine indexing

The indexing helps to optimize speed and performance in finding


relevant documents for a search query. Without an index, the search engine would
scan every document in the corpus, which would require considerable time and
computing power.
5. Increase the efficiency of Sentiment Analysis and Opinion Mining

Sentiment Analysis is the process of determining whether a piece of writing is


positive, negative or neutral. It’s also known as opinion mining, deriving the opinion
or attitude of a speaker. Sentiment analysis aims to determine the attitude of a
speaker, writer, or other subject with respect to some topic or the overall contextual
polarity or emotional reaction to a document, interaction, or event.
6. Implement the Recommendation System.
A recommender system or a recommendation system seeks to predict
the "rating" or "preference" that a user would give to an item . It includes variety of
area like movies, music, news, books, research articles, search queries, social tags,
and products in general

7. To implement the effective compression schemes for storing the data using
less storage space.
Search engine would scan every document in the corpus through
indexing. The indexed documents should be compressed in effective manner.
8. To develop the effective query refinement mechanism based on query
algebra.
Query expansion (QE) or refinement is the process of reformulating a
seed query to improve retrieval performance. In the context of search engines, query
expansion involves evaluating a user's input (what words were typed into the search
query area) and expanding the search query to match additional documents.
9. Personalize the search engine.
A web search engine is a software system that is designed to search for
information on the World Wide Web. Personalize the search engine for kids, to list
only research articles, image, and so on.

10. Personalized Web Search

Personalize web search using user-logged search behavior context using


user ids, queries, query terms, urls, url domains and clicks.

11. Consumer Products

Identify product mentions within a largely user-generated web-based corpus and


disambiguate the mentions against a large product catalog using blogs, forums,
product review sites, and e-commerce merchants.

12. Large Scale Hierarchical Text Classification

Hierarchies are becoming ever more popular for the organization of text
documents, particularly on the Web. Web directories and Wikipedia are two
examples of such hierarchies. Along with their widespread use comes the need for
automated classification of new documents to the categories in the hierarchy. As the
size of the hierarchy grows and the number of documents to be classified increases, a
number of interesting machine learning problems arise. In particular, it is one of the
rare situations where data sparsity remains an issue, despite the vastness of available
data: as more documents become available, more classes are also added to the
hierarchy, and there is a very high imbalance between the classes at different levels
of the hierarchy
13. Company Web

Given the data related to current employees and their provisioned access, models can
be built that automatically determine access privileges as employees enter and leave
roles within a company. These auto-access models seek to minimize the human
involvement required to grant or revoke employee access. The model will take an
employee's role information and a resource code and will return whether or not
access should be granted.

List of Case Studies:


1. Market -Customer analysis
2. Biological/ DNA sequence analysis
3. Detecting software bugs
4. Improving storage performance
5. Design of structured pattern mining methods
6. Network alarm pattern mining
7. XML query access pattern analysis
8. System performance
9. Telecommunication network
10. Financial and Scientific data
11. Creating adaptive web sites
12. System improvement
13. Navigation patterns WEBLOG.

Text Books
1. Bing Liu, “ Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data (Data-Centric
Systems and Applications)”, Springer; 2nd Edition 2010
2. Zdravko Markov, Daniel T. Larose, “Data Mining the Web: Uncovering Patterns in Web Content,
Structure, and Usage”, John Wiley & Sons, Inc., 2012

Reference Books
1. Guandong Xu ,Yanchun Zhang, Lin Li, “Web Mining and Social Networking: Techniques and
Applications”, Springer; 1st Edition.2010
2. Soumen Chakrabarti, “Mining the Web: Discovering Knowledge from Hypertext Data”, Morgan
Kaufmann; edition 2012
3. Adam Schenker, “Graph-Theoretic Techniques for Web Content Mining”, World Scientific Pub Co
Inc , 2015
4. Min Song, Yi Fang and Brook Wu, Handbook of research on Text and Web mining technologies,
IGI global, information Science Reference – imprint of :IGI publishing, 2011.
Web Mining
Knowledge Areas that contain topics and learning outcomes covered in the course

Knowledge Area Total Hours of Coverage

CS: IAS(Information Assurance and Security) 5

CS: IM(Information Management) 13

CS: Intelligent Systems (IS) 27

Body of Knowledge coverage


[List the Knowledge Units covered in whole or in part in the course. If in part, please indicate
which topics and/or learning outcomes are covered. For those not covered, you might want to
indicate whether they are covered in another course or not covered in your curriculum at all.
This section will likely be the most time-consuming to complete, but is the most valuable for
educators planning to adopt the CS2013 guidelines.]

KA Knowledge Unit Topics Covered Hours

CS: IAS/Web Security  Web security model and its applications 5


IAS  Browser security model
 HTTP security extensions

CS: IM/Information  Basic information storage and retrieval (IS&R) 4


IM Management concepts
Concepts  Information capture and representation
 Supporting human needs: searching, retrieving,
linking, browsing, navigating
 Analysis and indexing

CS: IM/Indexing  The impact of indices on query performance 6


IM  The basic structure of an index
 Indexing text
 Indexing the web (e.g., web crawling)
CS: IS IS/Basic Search  Uninformed search (breadth-first, depth-first, 3
Strategies depth-first with iterative deepening)
 Heuristics and informed search

CS: IS IS/Basic Machine • Definition and examples of broad variety of 23


Learning machine learning tasks, including classification
• Inductive learning
• Simple statistical-based learning, such as Naive
Bayesian Classifier, decision trees
• The over-fitting problem
• Measuring classifier accuracy

IS/Advanced Learning graphical models (Cross-reference 4


Machine Learning IS/Reasoning under Uncertainty)

---- ----- ----- ---

Include all the topic here


Total hours 45
Where does the course fit in the curriculum?
[In what year do students commonly take the course? Is it compulsory? Does it have pre-
requisites, required following courses? How many students take it?]

This course is a
 Elective Course.
 Suitable from 4th semester onwards.
 Knowledge of basic mathematics is essential.

What is covered in the course?


[A short description, and/or a concise list of topics - possibly from your course syllabus.(This is
likely to be your longest answer)]

Part 1: Introduction to Web Mining


It introduces what is web mining and its architecture, challenges and security over the web.

Part II: Web Crawling and Indexing


This section covers the way to fetch and store the data from the web using recent algorithms.
Part III: Three categories of web mining
This section explains web mining in three different categories, its explained using the recent
algorithms.

What is the format of the course?


[Is it face to face, online or blended? How many contact hours? Does it have lectures, lab
sessions, discussion classes?]

This Course is designed with 100 minutes of in-classroom sessions per week, 60 minutes of
video/reading instructional material per week, 100 minutes of lab hours per week, as well as
200 minutes of non-contact time spent on implementing course related project. Generally this
course should have the combination of lectures, in-class discussion, case studies, guest-lectures,
mandatory off-class reading material, quizzes.

How are students assessed?


[What type, and number, of assignments are students are expected to do? (papers, problem sets,
programming projects, etc.). How long do you expect students to spend on completing assessed
work?]

 Students are assessed on a combination group activities, classroom discussion, projects,


and continuous, final assessment tests.

 Additional weightage will be given based on their rank in crowd sourced projects/ Kaggle
like competitions.

 Students can earn additional weightage based on certificate of completion of a related


MOOC course.

Additional topics
[List notable topics covered in the course that you do not find in the CS2013 Body of
Knowledge]

Other comments
[optional]
Session wise plan
Student Outcomes Covered: 2, 11, 14, 17

Class Hour Lab Topic Covered levels of Reference Remarks


Hour mastery Book

2 Introduction and Familiarity 1


Architecture of the
WWW
1 Web Document Usage 1
Representation-
Web Search Engine
– Challenges
1 Web security Familiarity 1
overview and
concepts, Web
application
security, Basic web
security model
1 Web Hacking Familiarity 1
Basics HTTP &
HTTPS URL, Web
Under the Cover
Overview of Java
security Reading
the HTML source
2 Basic Crawler Usage 1,2
Algorithm:
Breadth-First/
depth-First
Crawlers
1 Universal Crawlers Usage 1,2
2 Preferential Usage 1,2
Crawlers : Focused
Crawlers - Topical
Crawlers.

3 Static and Dynamic Familiarity 1


Inverted Index–
Index Construction
and Index
Compression-
Latent Semantic
Indexing
2 Searching using an Usage 1,2
Inverted Index:
Sequential Search -
Pattern Matching -
Similarity search
3 Link Analysis - Familiarity 1,2
Social Network
Analysis - Co-
Citation and
Bibliographic
Coupling
3 Page Rank- Usage 1,2
Weighted Page
Rank
2 Community Familiarity 1,2
Discovery - Web
Graph
Measurement and
Modelling- Using
Link Information
for Web Page
Classification.

3 Classification: Assessment 1,2,3


Decision tree for
Text Document-
Naive Bayesian
Text Classification
- Ensemble of
Classifiers.
3 Clustering: K- Assessment 1,2
means Clustering -
Hierarchical
Clustering –
Markov Models -
Probability-Based
Clustering.
2 Vector Space Usage 1
Model – Latent
semantic Indexing
– Automatic Topic
Extraction from
Web Documents.

2 Web Usage Mining Usage 1,2


- Click stream
Analysis -Web
Server Log Files -
Data Collection
and Pre-Processing
- Data Modelling
for Web Usage
Mining
4 The BIRCH Usage 1,2
Clustering
Algorithm -
Modelling web
user interests using
clustering- Affinity
Analysis and the A
Priori Algorithm –
Binning
2 Web usage mining Usage 1,2
using Probabilistic
Latent Semantic
Analysis – Finding
User Access
Pattern via Latent
Dirichlet
Allocation Model.

2 Relevance Usage 1,2


Feedback and
Query Expansion -
Automatic Local
and Global
Analysis
2 Application Assessement

45 Hours (3
Credit hours
/week  15
Weeks
schedule)

Approved by Academic Council No.:47 Date: 05.10.2017

Вам также может понравиться