Machine Learning For Text and Web Mining

Machine Learning Methods
for
Text / Web Data Mining
Byoung-Tak Zhang
School of Computer Science and Engineering
Seoul National University
E-mail: btzhang@cse.snu.ac.kr
This material is available at
http://scai.snu.ac.kr./~btzhang/
Overview
z
Introduction
4Web Information Retrieval
4Machine Learning (ML)
4ML Methods for Text/Web Data Mining
Text/Web Data Analysis

4Text Mining Using Helmholtz Machines
4Web Mining Using Bayesian Networks
Summary
4Current and Future Work
2
Web Information Retrieval

Classification System
Preprocessing and Indexing

Text Data
Text Classification
Information Filtering
Information Extraction
DB Template Filling
& Information
Extraction System
Information Filtering System
DB Record
Location
user profile
filtered data
Date
question
answer
feedback
DB
Machine Learning
z
Supervised Learning
4Estimate an unknown mapping from known inputoutput pairs
4Learn fw from training set D={(x,y)} s.t. f w (x) = y = f (x)
4Classification: y is discrete
4Regression: y is continuous
z
Unsupervised Learning
4Only input values are provided
4Learn fw from D={(x)} s.t. f w (x) = x
4Density Estimation
4Compression, Clustering
4
Machine Learning Methods

z
Neural Networks
4 Multilayer Perceptrons (MLPs)
4 Self-Organizing Maps (SOMs)
4 Support Vector Machines (SVMs)
Probabilistic Models
4 Bayesian Networks (BNs)
4 Helmholtz Machines (HMs)
4 Latent Variable Models (LVMs)
Other Machine Learning Methods

4 Evolutionary Algorithms (EAs)
4 Reinforcement Learning (RL)
4 Boosting Algorithms
4 Decision Trees (DTs)
5
ML for Text/Web Data Mining

z
z
z
z
z
z
z
Bayesian Networks for Text Classification

Helmholtz Machines for Text Clustering/Categorization
Latent Variable Models for Topic Word Extraction
Boosted Learning for TREC Filtering Task
Evolutionary Learning for Web Document Retrieval
Reinforcement Learning for Web Filtering Agents
Bayesian Networks for Web Customer Data Mining
Preprocessing for Text Learning

From: xxx@sciences.sdsu.edu
Newsgroups:comp.graphics
Subject: Need specs on Apple
QT
I need to the specs, or at least a
very verbose interpretation of the
specs, for QuckTime. Technical
articles from magazines and
references to books would be
nice, too.
I also need the specs in a format
usable on a Unix or MS-Dos
system. I cant do much with the
QuickTime stuff they have on..
0 baseball
0 car
0 clinton
0 computer
0 graphics
0 hockey
2 quicktime
.
.
1 references
0 space
3 specs
1 unix
.
Text Mining: Data Sets

z
Usenet Newsgroup Data

420 categories
41000 documents for each category
420000 documents in total.
TDT2 Corpus
4Target detection and tracking (TDT): NIST
4Used 6,169 documents in experiments
Text Mining:
Helmholtz Machine Architecture
h1
h2
hm
: recognition weight
: generative weight
P (hi = 1) =
d1
d2
d3
dn
P (d i = 1) =
1
n
1 + exp bi wij d j
j =1
1
m
1 + exp bi wij h j
j
=
1
4Latent nodes
4 [Chang and Zhang, 2000]

4 Input nodes
Binary values
Extract the underlying causal
structure in the document set.
Capture correlations of the words
in documents.
Binary values
Represent the existence or
absence of words in documents.
Text Mining:
Learning Helmholtz Machines
4Introduce a recognition network for estimation of a
generative network.
( ) ( ( ) )
P d (t ) , (t ) |
log( D | ) = log P d ( t ) , (t ) | = log Q ( t )
Q (t )
t =1
( t )
t =1
(t )
(t )
(t )
T
,
|
P
d
Q (t ) log
Q (t )
t =1 ( t )
( )
( )
4Wake-Sleep Algorithm
Train the recognition and generative models alternately.
Update the weight in network iteratively by simple local delta rule.
wijnew = wijold + wij

wij = si ( s j p( s j = 1))
10
Text Mining: Methods

z Text
Categorization
4 Train a Helmholtz machine for each category.

4 Total N machines for N categories.
4 Once the N machines have been estimated, classification of a test
document proceeds by estimating the likelihood of the document
for each machine.
c = arg max[log P(d | c)]

cC
z Topic
Words Extraction
4 For the entire document sets, train a Helmholtz machine.

4 After training, examine the weights of connections from a latent
node to input nodes, that is words.
11
Text Mining: Categorization Results

4Usenet Newsgroup Data
20 categories, 1000 documents for each category, 20000 documents in
total.
12
Text Mining: Topic Words Extraction Results

4TDT2 Corpus
46,169 documents
1
tabacco, smoking, gingrich, newt, trent, republicans, congressional, republicans, attorney,

smokers, lawsuit, senate, cigarette, morris, nicotine
warplane, airline, saudi, gulf, wright, soldiers, yitzhak, tanks, stealth, sabah, stations, kurds,
mordechai, separatist, governor
olympics, nagano, olympic, winter, medal, hockey, atheletes, cup, games, slalom, medals, bronze,
skating, lillehammer, downhill
netanyahu, palestinian, arafat, israeli, yasser, kofi, annan, benjamin, palestinians, mideast, gaza,
jerusalem, eu, paris, israel
India, pakistan, pakistani, delhi, hindu, vajpayee, nuclear, tests, atal, kashmir, indian, janata,
bharatiya, islamabad, bihari
Suharto, habibie, demonstrators, riots, indonesians, demonstrations, soeharto, resignation,

jakarta, rioting, electoral, rallies, wiranto, unrest, megawati
imf, monetary, currencies, currency, rupiah, singapore, bailout, traders, markets, thailand,
inflation, investors, fund, banks, baht
pope, cuba, cuban, embargo, castro, lifting, cubans, havana, alan, invasion, reserve, paul, output,
vatican, freedom
13
Web Mining: Customer Analysis

z KDD-2000
Web Mining Competition
4Data: 465 features over 1700 customers

Features include friend promotion rate, date visited,
weight of items, price of house, discount rate,
Data was collected during Jan. 30 March 30, 2000
Friend promotion was started from Feb. 29 with TV
advertisement.
4Aims: Description of heavy/low spenders
14
15
16
Web Mining: Feature Selection

z
Features selected by various ways [Yang & Zhang, 2000]
DecisionTree+Factor
Analysis
Decision Tree
Discriminant Model
V368 (Weight Average)

V243 (OrderLine Quantity Sum)
V245 (OrderLine Quantity
Maximum)
F1 = 0.94*V324 + 0.868*V374
+ 0.898*V412
F2 = 0.829*V234 +
0.857*V240
F3 = -0.795*V237+
0.778*V304
V13 (SendEmail)
V234 (OrderItemQuantity Sum%
HavingDiscountRange(5 . 10))
V237 (OrderItemQuantitySum%
Having DiscountRange(10.))
V240 (Friend)
V243 (OrderLineQuantitySum)
V245 (OrderLineQuantity Maximum)
V304 (OrderShippingAmtMin)
V324 (NumLegwearProduct
Views)
V374 (NumMainTemplateViews)
V412 (NumReplenishable
Stock Views)
V240 (Friend)
V229 (Order-Average)
V304 (OrderShippingAmtMin.)
V43 (Home Market Value)
V377 (NumAcountTemplate Views)
+
V11 (Which
DoYouWearMostFrequent)
V13 (SendEmail)
V17 (USState)
V45 (VehicleLifeStyle)
V68 (RetailActivity)
V19 (Date)
17
Web Mining: Bayesian Nets

z Bayesian
network
4 DAG (Directed Acyclic Graph)

4 Express dependence relations between variables
4 Can use prior knowledge on the data (parameters)
C
P(A,B,C,D,E) = P(A)P(B|A)P(C|B)
P(D|A,B)P(E|B,C,D)
4 Examples of conjugate priors:

Dirichlet for multinomial data, Normal-Wishart for normal data
18
Web Mining: Results

z
A Bayesian net for

KDD web data
V229 (Order-Average) and

V240 (Friend) directly
influence V312 (Target)
V19 (Date) was influenced by

V240 (Friend) reflecting the
TV advertisement.
19
Summary
z
We study machine learning methods, such as

4 Probabilistic neural networks
4 Evolutionary algorithms
4 Reinforcement learning
Application areas include

4 Text mining
4 Web mining
4 Bioinformatics (not addressed in this talk)
Recent work focuses on probabilistic graphical models for

web/text/bio data mining, including
4 Bayesian networks
4 Helmholtz machines
4 Latent variable models
20
10
21
Bayesian Networks:
Architecture
L
P ( L, B, G , M ) = P ( L) P ( B | L) P (G | L, B ) P ( M | L, B, G )
= P ( L) P ( B ) P (G | B ) P ( M | B, L)
z
A Bayesian network represents the probabilistic

relationships between the variables.
n
P ( X) = P ( X i | pa i )
i =1
pai is the set of parent nodes of Xi.
22
11
Bayesian Networks:
Applications in IR A Simple BN for Text Classification
C
C: document class
ti: ith term
t1
z
z
z
t2
t8754
The network structure represents the nave Bayes assumption.

All nodes are binary.
[Hwang & Zhang, 2000]
23
Bayesian Networks:
Experimental Results
z
Dataset
4The acq dataset from Reuters-21578
48754 terms were selected by TFIDF.
4Training data: 8762 documents
4Test data: 3009 documents
Parametric Learning
4Dirichlet prior assumptions for the network parameter
distributions.
p ( ij | S h ) = Dir ( ij | ij1 ,..., ijri )
4Parameter distributions are updated with training data.

p( ij | D, S h ) = Dir( ij | ij1 + Nij1 ,...,ijri + Nijri )
24
12
Bayesian Networks:
Experimental Results
For training data
4Accuracy: 94.28%
Recall (%)
Precision (%)
Positive examples
96.83
75.98
Negative examples
93.76
99.32
Recall (%)
Precision (%)
Positive examples
95.16
89.17
Negative examples
96.88
98.67
For test data
4Accuracy: 96.51%
25
Latent Variable Models:

Architecture
z
z
[Shin & Zhang, 2000]

Maximize log-likelihood
N
L = n(d n , wm ) log P(d n , wm )

n =1 m =1
N
= n(d n , wm ) log P( zk ) P( wm | zk ) P(d n | zk )

n =1 m =1
Document
Clustering
k =1
4Update P( zk ) , P(wm | zk ) , P(d n | zk ) .

4With EM Algorithm
Topic-Words
Extraction
Latent Variable Model for

Topic Words Extraction and Document Clustering
26
13

Learning
z
EM (Expectation-Maximization) Algorithm
4Algorithm to maximize pre-defined log-likelihood
Iteration of E-Step and M-Step

4E-Step
M-Step
N
P( zk | d n , wm )
=
P( zk ) P(d n | zk ) P( wm | zk )
K
P( z ) P( d
k =1
P( wm | z k ) =
| zk ) P( wm | zk )
n( d
n =1
M N
, wm ) P( z k | d n , wm )
n(d
m =1 n =1
P(d n | z k ) =
n( d
m =1
M N
, wm ) P( z k | d n , wm )
, wm ) P( z k | d n , wm )
n(d
m =1 n =1
P( z k ) =
, wm ) P( z k | d n , wm )
1 M N
n(d n , wm ) P( z k | d n , wm ),
R m=1 n =1
N
R n(d n , wm )
27
m =1 n =1

Applications in IR Experimental Results
z
z
Topic Words Extraction and Document Clustering with a

subset of TREC-8 data
TREC-8 adhoc task data
4 Documents: DTDS, FR94, FT, FBIS, LATIMES
4 Topics: 401-450 (401, 434, 439, 450)
4 401: Foreign Minorities, Germany
4 434: Estonia, Economy
4 439: Inventions, Scientific discovery
4 450: King Hussein, Peace
28
14

Applications in IR Experimental Results
Label (assigned to zk with Maximum P(di|zk) )
Topic (#Docs)
z2
z4
z3
z1
Precision
Recall
401 (300)
279 1
0
20
0.902
0.930
434 (347)
20
238 10
79
0.996
0.686
439 (219)
7
0
203 9
0.953
0.927
450 (293)
3
0
0
290 0.729
0.990
Topics
Extracted Topic Words (top 35 words with highest P(wj|zk)
Cluster 2
(z2 )
Cluster 4
(z4)
Cluster 3
(z3)
Cluster 1
(z1)
german, germani, mr, parti, year, foreign, people, countri, govern, asylum, polit, nation,
law, minist, europ, state, immigr, democrat, wing, social, turkish, west, east, member,
attack,
percent, estonia, bank, state, privat, russian, year, enterprise, trade, million, trade, estonian,
econom, countri, govern, compani, foreign, baltic, polish, loan, invest, fund, product,
research, technology, develop, mar, materi, system, nuclear, environment, electr, process,
product, power, energi, countrol, japan, pollution, structur, chemic, plant,
jordan, peac, isreal, palestinian, king, isra, arab, meet, talk, husayn, agreem, presid, majesti,
negoti, minist, visit, region, arafat, secur, peopl, east, washington, econom, sign, relat,
jerusalem, rabin, syria, iraq,
29
Boosting:
Algorithms
z
z
A general method of converting rough rules into a highly accurate

prediction rule
Learning procedure
4 Examine the training set
4 Derive a rough rule (weak learner)
4 Re-weight the examples in the training set, concentrating on the hard cases for
previous rules
4 Repeat T times
Importance weights
of training documents
Learner
Learner
Learner
Learner
h1
h2
h3
h4
f ( h1 , h2 , h3 , h4 )
30
15
Boosting:
Applied to Text Filtering
z
Nave Bayes
4 Traditional algorithm for text filtering
c NM =
arg max
c j { relevant , irrelevant }
P (c j ) P ( d i | c j )
= arg max P (c j ) P ( wik | c j )

cj
Assume independence
among terms
k =1
= arg max P (c j ) P ( wi1 =" our"| c j ) P ( wi 2 =" approach"| c j ) ...

cj
P ( win =" trouble"| c j )

z
Boosting nave Bayes

4 Using nave Bayes classifiers as weak learners
4 [Kim & Zhang, SIGIR-2000]
31
Boosting:
Applied to Text Filtering Experimental Results
z
TREC (Text Retrieval Conference)

4 Sponsored by NIST
TREC-7 filtering datasets

4 Training Documents
AP articles (1988)
237 MB, 79919 documents
4 Test Documents
AP articles (1989~1990)
TREC-8 filtering datasets

4 Training Documents
Financial Times (1991~1992)
4 Test Documents
Financial Time (1993~1994)
4 No. of topics: 50
4 No. of topics: 50
Example of a document
32
16
Boosting:
Applied to Text Filtering Experimental Results
Compared with the state-of-the-art text filtering systems
TREC-7
Averaged Scaled F1
Averaged Scaled F3
Boosting
ATT
NTT
PIRC
Boosting
ATT
NTT
PIRC
0.474
0.461
0.452
0.500
0.467
0.460
0.505
0.509
TREC-8
Averaged Scaled LF1
Averaged Scaled LF2
Boosting
PLT1
PLT2
PIRC
Boosting
CL
PIRC
Mer
0.717
0.712
0.713
0.714
0.722
0.721
0.734
0.720
33
Evolutionary Learning:
Applications in IR - Web-Document Retrieval
z
[Kim & Zhang, 2000]
Link Information,
HTML Tag
Information
Retrieval
<TITLE> <H> <B>
<A>
ww11 ww22 ww33

wn
w1 w2 w3
wwnn
chromosomes
ww11 ww22 ww33

wn
w1 w2 w3
wwnn
Fitness
34
17
Evolutionary Learning:
Applications in IR Tag Weighting
z
Crossover
chromosome X
x1
x2
x3
Mutation
chromosome Y
xn
y1
y2
y3
yn
chromosome X
x1
x2
x3
change value w.p. Pm
zi = (xi + yi ) / 2 w.p. Pc
z1
z2
z3
zn
x1
chromosome Z (offspring)
xn
x2
x3
xn
chromosome X
Truncation selection
35
Evolutionary Learning :
Applications in IR - Experimental Results
z
Datasets
4 TREC-8 Web Track Data
4 2GB, 247491 web documents (WT2g)
4 No. of training topics: 10, No. of test topics: 10
Results
36
18
Reinforcement Learning:
Basic Concept
Agent
1. State st
2. Action at
Reward rt
3. Reward rt+1
Environment
4. State st+1
37
Applications in IR - Information Filtering
[Seo & Zhang, 2000]
WAIR
retrieve documents
calculate similarity
2. Actioni
(modify profile)
Rewar
(user profile) di
1. Statei
User profile
3. Rewardi+1 (relevance
feedback)
User
4. Statei+1
Document filtering
...
Filtered documents
38
19
Experimental Results (Explicit Feedback)
(%)
39
Experimental Results (Implicit Feedback)
(%)
40
20

Machine Learning For Text and Web Mining

Загружено:

Сведения о документе

Исходное описание:

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Machine Learning For Text and Web Mining

Загружено:

Авторское право:

Доступные форматы

Machine Learning Methods

Text/Web Data Analysis

Web Information Retrieval

Preprocessing and Indexing

Information Filtering System

Machine Learning Methods

Other Machine Learning Methods

ML for Text/Web Data Mining

Bayesian Networks for Text Classification

Preprocessing for Text Learning

Text Mining: Data Sets

Usenet Newsgroup Data

4 [Chang and Zhang, 2000]

wijnew = wijold + wij

Text Mining: Methods

4 Train a Helmholtz machine for each category.

c = arg max[log P(d | c)]

4 For the entire document sets, train a Helmholtz machine.

Text Mining: Categorization Results

Text Mining: Topic Words Extraction Results

tabacco, smoking, gingrich, newt, trent, republicans, congressional, republicans, attorney,

Suharto, habibie, demonstrators, riots, indonesians, demonstrations, soeharto, resignation,

Web Mining: Customer Analysis

Web Mining Competition

4Data: 465 features over 1700 customers

4Aims: Description of heavy/low spenders

Web Mining: Feature Selection

Features selected by various ways [Yang & Zhang, 2000]

V368 (Weight Average)

Web Mining: Bayesian Nets

4 DAG (Directed Acyclic Graph)

4 Examples of conjugate priors:

Web Mining: Results

A Bayesian net for

V229 (Order-Average) and

V19 (Date) was influenced by

We study machine learning methods, such as

Application areas include

Recent work focuses on probabilistic graphical models for

A Bayesian network represents the probabilistic

pai is the set of parent nodes of Xi.

The network structure represents the nave Bayes assumption.

4Parameter distributions are updated with training data.

For test data

Latent Variable Models:

[Shin & Zhang, 2000]

L = n(d n , wm ) log P(d n , wm )

= n(d n , wm ) log P( zk ) P( wm | zk ) P(d n | zk )

4Update P( zk ) , P(wm | zk ) , P(d n | zk ) .

Latent Variable Model for

Latent Variable Models:

Iteration of E-Step and M-Step

Latent Variable Models:

Topic Words Extraction and Document Clustering with a

Latent Variable Models:

A general method of converting rough rules into a highly accurate

= arg max P (c j ) P ( wik | c j )

= arg max P (c j ) P ( wi1 =" our"| c j ) P ( wi 2 =" approach"| c j ) ...

P ( win =" trouble"| c j )

Boosting nave Bayes

TREC (Text Retrieval Conference)

TREC-7 filtering datasets