Вы находитесь на странице: 1из 20

Machine Learning Methods

for
Text / Web Data Mining
Byoung-Tak Zhang
School of Computer Science and Engineering
Seoul National University
E-mail: btzhang@cse.snu.ac.kr
This material is available at
http://scai.snu.ac.kr./~btzhang/

Overview
z

Introduction
4Web Information Retrieval
4Machine Learning (ML)
4ML Methods for Text/Web Data Mining

Text/Web Data Analysis


4Text Mining Using Helmholtz Machines
4Web Mining Using Bayesian Networks

Summary
4Current and Future Work
2

Web Information Retrieval


Classification System

Preprocessing and Indexing


Text Data

Text Classification

Information Filtering

Information Extraction

DB Template Filling
& Information
Extraction System

Information Filtering System

DB Record

Location

user profile
filtered data

Date

question
answer

feedback

DB

Machine Learning
z

Supervised Learning
4Estimate an unknown mapping from known inputoutput pairs
4Learn fw from training set D={(x,y)} s.t. f w (x) = y = f (x)

4Classification: y is discrete
4Regression: y is continuous
z

Unsupervised Learning
4Only input values are provided
4Learn fw from D={(x)} s.t. f w (x) = x
4Density Estimation
4Compression, Clustering
4

Machine Learning Methods


z

Neural Networks
4 Multilayer Perceptrons (MLPs)
4 Self-Organizing Maps (SOMs)
4 Support Vector Machines (SVMs)

Probabilistic Models
4 Bayesian Networks (BNs)
4 Helmholtz Machines (HMs)
4 Latent Variable Models (LVMs)

Other Machine Learning Methods


4 Evolutionary Algorithms (EAs)
4 Reinforcement Learning (RL)
4 Boosting Algorithms
4 Decision Trees (DTs)
5

ML for Text/Web Data Mining


z
z
z
z
z
z
z

Bayesian Networks for Text Classification


Helmholtz Machines for Text Clustering/Categorization
Latent Variable Models for Topic Word Extraction
Boosted Learning for TREC Filtering Task
Evolutionary Learning for Web Document Retrieval
Reinforcement Learning for Web Filtering Agents
Bayesian Networks for Web Customer Data Mining

Preprocessing for Text Learning


From: xxx@sciences.sdsu.edu
Newsgroups:comp.graphics
Subject: Need specs on Apple
QT
I need to the specs, or at least a
very verbose interpretation of the
specs, for QuckTime. Technical
articles from magazines and
references to books would be
nice, too.
I also need the specs in a format
usable on a Unix or MS-Dos
system. I cant do much with the
QuickTime stuff they have on..

0 baseball
0 car
0 clinton
0 computer
0 graphics
0 hockey
2 quicktime
.
.
1 references
0 space
3 specs
1 unix
.

Text Mining: Data Sets


z

Usenet Newsgroup Data


420 categories
41000 documents for each category
420000 documents in total.

TDT2 Corpus
4Target detection and tracking (TDT): NIST
4Used 6,169 documents in experiments

Text Mining:
Helmholtz Machine Architecture
h1

h2

hm

: recognition weight
: generative weight
P (hi = 1) =

d1

d2

d3

dn

P (d i = 1) =

1
n

1 + exp bi wij d j
j =1

1
m

1 + exp bi wij h j
j
=
1

4Latent nodes

4 [Chang and Zhang, 2000]


4 Input nodes

Binary values
Extract the underlying causal
structure in the document set.
Capture correlations of the words
in documents.

Binary values
Represent the existence or
absence of words in documents.

Text Mining:
Learning Helmholtz Machines
4Introduce a recognition network for estimation of a
generative network.

( ) ( ( ) )

P d (t ) , (t ) |
log( D | ) = log P d ( t ) , (t ) | = log Q ( t )

Q (t )
t =1
( t )
t =1
(t )

(t )
(t )
T
,
|
P
d

Q (t ) log
Q (t )
t =1 ( t )

( )

( )

4Wake-Sleep Algorithm
Train the recognition and generative models alternately.
Update the weight in network iteratively by simple local delta rule.

wijnew = wijold + wij


wij = si ( s j p( s j = 1))

10

Text Mining: Methods


z Text

Categorization

4 Train a Helmholtz machine for each category.


4 Total N machines for N categories.
4 Once the N machines have been estimated, classification of a test
document proceeds by estimating the likelihood of the document
for each machine.

c = arg max[log P(d | c)]


cC

z Topic

Words Extraction

4 For the entire document sets, train a Helmholtz machine.


4 After training, examine the weights of connections from a latent
node to input nodes, that is words.
11

Text Mining: Categorization Results


4Usenet Newsgroup Data
20 categories, 1000 documents for each category, 20000 documents in
total.

12

Text Mining: Topic Words Extraction Results


4TDT2 Corpus
46,169 documents
1

tabacco, smoking, gingrich, newt, trent, republicans, congressional, republicans, attorney,


smokers, lawsuit, senate, cigarette, morris, nicotine

warplane, airline, saudi, gulf, wright, soldiers, yitzhak, tanks, stealth, sabah, stations, kurds,
mordechai, separatist, governor

olympics, nagano, olympic, winter, medal, hockey, atheletes, cup, games, slalom, medals, bronze,
skating, lillehammer, downhill

netanyahu, palestinian, arafat, israeli, yasser, kofi, annan, benjamin, palestinians, mideast, gaza,
jerusalem, eu, paris, israel

India, pakistan, pakistani, delhi, hindu, vajpayee, nuclear, tests, atal, kashmir, indian, janata,
bharatiya, islamabad, bihari

Suharto, habibie, demonstrators, riots, indonesians, demonstrations, soeharto, resignation,


jakarta, rioting, electoral, rallies, wiranto, unrest, megawati

imf, monetary, currencies, currency, rupiah, singapore, bailout, traders, markets, thailand,
inflation, investors, fund, banks, baht

pope, cuba, cuban, embargo, castro, lifting, cubans, havana, alan, invasion, reserve, paul, output,
vatican, freedom

13

Web Mining: Customer Analysis


z KDD-2000

Web Mining Competition

4Data: 465 features over 1700 customers


Features include friend promotion rate, date visited,
weight of items, price of house, discount rate,
Data was collected during Jan. 30 March 30, 2000
Friend promotion was started from Feb. 29 with TV
advertisement.

4Aims: Description of heavy/low spenders

14

15

16

Web Mining: Feature Selection


z

Features selected by various ways [Yang & Zhang, 2000]

DecisionTree+Factor
Analysis

Decision Tree

Discriminant Model

V368 (Weight Average)


V243 (OrderLine Quantity Sum)
V245 (OrderLine Quantity
Maximum)
F1 = 0.94*V324 + 0.868*V374
+ 0.898*V412
F2 = 0.829*V234 +
0.857*V240
F3 = -0.795*V237+
0.778*V304

V13 (SendEmail)
V234 (OrderItemQuantity Sum%
HavingDiscountRange(5 . 10))
V237 (OrderItemQuantitySum%
Having DiscountRange(10.))
V240 (Friend)
V243 (OrderLineQuantitySum)
V245 (OrderLineQuantity Maximum)
V304 (OrderShippingAmtMin)
V324 (NumLegwearProduct
Views)
V368 (Weight Average)
V374 (NumMainTemplateViews)
V412 (NumReplenishable
Stock Views)

V240 (Friend)
V229 (Order-Average)
V304 (OrderShippingAmtMin.)
V368 (Weight Average)
V43 (Home Market Value)
V377 (NumAcountTemplate Views)
+
V11 (Which
DoYouWearMostFrequent)
V13 (SendEmail)
V17 (USState)
V45 (VehicleLifeStyle)
V68 (RetailActivity)
V19 (Date)
17

Web Mining: Bayesian Nets


z Bayesian

network

4 DAG (Directed Acyclic Graph)


4 Express dependence relations between variables
4 Can use prior knowledge on the data (parameters)

C
P(A,B,C,D,E) = P(A)P(B|A)P(C|B)
P(D|A,B)P(E|B,C,D)

4 Examples of conjugate priors:


Dirichlet for multinomial data, Normal-Wishart for normal data

18

Web Mining: Results


z

A Bayesian net for


KDD web data

V229 (Order-Average) and


V240 (Friend) directly
influence V312 (Target)

V19 (Date) was influenced by


V240 (Friend) reflecting the
TV advertisement.

19

Summary
z

We study machine learning methods, such as


4 Probabilistic neural networks
4 Evolutionary algorithms
4 Reinforcement learning

Application areas include


4 Text mining
4 Web mining
4 Bioinformatics (not addressed in this talk)

Recent work focuses on probabilistic graphical models for


web/text/bio data mining, including
4 Bayesian networks
4 Helmholtz machines
4 Latent variable models
20

10

21

Bayesian Networks:
Architecture
L

P ( L, B, G , M ) = P ( L) P ( B | L) P (G | L, B ) P ( M | L, B, G )
= P ( L) P ( B ) P (G | B ) P ( M | B, L)
z

A Bayesian network represents the probabilistic


relationships between the variables.
n

P ( X) = P ( X i | pa i )
i =1

pai is the set of parent nodes of Xi.

22

11

Bayesian Networks:
Applications in IR A Simple BN for Text Classification
C

C: document class
ti: ith term

t1
z
z
z

t2

t8754

The network structure represents the nave Bayes assumption.


All nodes are binary.
[Hwang & Zhang, 2000]
23

Bayesian Networks:
Experimental Results
z

Dataset
4The acq dataset from Reuters-21578
48754 terms were selected by TFIDF.
4Training data: 8762 documents
4Test data: 3009 documents

Parametric Learning
4Dirichlet prior assumptions for the network parameter
distributions.
p ( ij | S h ) = Dir ( ij | ij1 ,..., ijri )

4Parameter distributions are updated with training data.


p( ij | D, S h ) = Dir( ij | ij1 + Nij1 ,...,ijri + Nijri )

24

12

Bayesian Networks:
Experimental Results
For training data

4Accuracy: 94.28%
Recall (%)

Precision (%)

Positive examples

96.83

75.98

Negative examples

93.76

99.32

Recall (%)

Precision (%)

Positive examples

95.16

89.17

Negative examples

96.88

98.67

For test data

4Accuracy: 96.51%

25

Latent Variable Models:


Architecture
z
z

[Shin & Zhang, 2000]


Maximize log-likelihood
N

L = n(d n , wm ) log P(d n , wm )


n =1 m =1
N

= n(d n , wm ) log P( zk ) P( wm | zk ) P(d n | zk )


n =1 m =1

Document
Clustering

k =1

4Update P( zk ) , P(wm | zk ) , P(d n | zk ) .


4With EM Algorithm

Topic-Words
Extraction

Latent Variable Model for


Topic Words Extraction and Document Clustering

26

13

Latent Variable Models:


Learning
z

EM (Expectation-Maximization) Algorithm
4Algorithm to maximize pre-defined log-likelihood

Iteration of E-Step and M-Step


4E-Step

M-Step
N

P( zk | d n , wm )
=

P( zk ) P(d n | zk ) P( wm | zk )
K

P( z ) P( d
k =1

P( wm | z k ) =

| zk ) P( wm | zk )

n( d

n =1
M N

, wm ) P( z k | d n , wm )

n(d
m =1 n =1

P(d n | z k ) =

n( d

m =1
M N

, wm ) P( z k | d n , wm )

, wm ) P( z k | d n , wm )

n(d
m =1 n =1

P( z k ) =

, wm ) P( z k | d n , wm )

1 M N
n(d n , wm ) P( z k | d n , wm ),
R m=1 n =1
N

R n(d n , wm )

27

m =1 n =1

Latent Variable Models:


Applications in IR Experimental Results
z
z

Topic Words Extraction and Document Clustering with a


subset of TREC-8 data
TREC-8 adhoc task data
4 Documents: DTDS, FR94, FT, FBIS, LATIMES
4 Topics: 401-450 (401, 434, 439, 450)
4 401: Foreign Minorities, Germany
4 434: Estonia, Economy
4 439: Inventions, Scientific discovery
4 450: King Hussein, Peace

28

14

Latent Variable Models:


Applications in IR Experimental Results
Label (assigned to zk with Maximum P(di|zk) )
Topic (#Docs)
z2
z4
z3
z1
Precision
Recall
401 (300)
279 1
0
20
0.902
0.930
434 (347)
20
238 10
79
0.996
0.686
439 (219)
7
0
203 9
0.953
0.927
450 (293)
3
0
0
290 0.729
0.990
Topics
Extracted Topic Words (top 35 words with highest P(wj|zk)
Cluster 2
(z2 )
Cluster 4
(z4)
Cluster 3
(z3)
Cluster 1
(z1)

german, germani, mr, parti, year, foreign, people, countri, govern, asylum, polit, nation,
law, minist, europ, state, immigr, democrat, wing, social, turkish, west, east, member,
attack,
percent, estonia, bank, state, privat, russian, year, enterprise, trade, million, trade, estonian,
econom, countri, govern, compani, foreign, baltic, polish, loan, invest, fund, product,
research, technology, develop, mar, materi, system, nuclear, environment, electr, process,
product, power, energi, countrol, japan, pollution, structur, chemic, plant,
jordan, peac, isreal, palestinian, king, isra, arab, meet, talk, husayn, agreem, presid, majesti,
negoti, minist, visit, region, arafat, secur, peopl, east, washington, econom, sign, relat,
jerusalem, rabin, syria, iraq,
29

Boosting:
Algorithms
z
z

A general method of converting rough rules into a highly accurate


prediction rule
Learning procedure
4 Examine the training set
4 Derive a rough rule (weak learner)
4 Re-weight the examples in the training set, concentrating on the hard cases for
previous rules
4 Repeat T times

Importance weights
of training documents

Learner

Learner

Learner

Learner

h1

h2

h3

h4

f ( h1 , h2 , h3 , h4 )
30

15

Boosting:
Applied to Text Filtering
z

Nave Bayes
4 Traditional algorithm for text filtering
c NM =

arg max
c j { relevant , irrelevant }

P (c j ) P ( d i | c j )

= arg max P (c j ) P ( wik | c j )


cj

Assume independence
among terms

k =1

= arg max P (c j ) P ( wi1 =" our"| c j ) P ( wi 2 =" approach"| c j ) ...


cj

P ( win =" trouble"| c j )


z

Boosting nave Bayes


4 Using nave Bayes classifiers as weak learners
4 [Kim & Zhang, SIGIR-2000]
31

Boosting:
Applied to Text Filtering Experimental Results
z

TREC (Text Retrieval Conference)


4 Sponsored by NIST

TREC-7 filtering datasets


4 Training Documents
AP articles (1988)
237 MB, 79919 documents

4 Test Documents
AP articles (1989~1990)
471 MB, 162999 documents

TREC-8 filtering datasets


4 Training Documents
Financial Times (1991~1992)
167 MB, 64139 documents

4 Test Documents
Financial Time (1993~1994)
382 MB, 140651 documents

4 No. of topics: 50

4 No. of topics: 50

Example of a document
32

16

Boosting:
Applied to Text Filtering Experimental Results
Compared with the state-of-the-art text filtering systems
TREC-7
Averaged Scaled F1

Averaged Scaled F3

Boosting

ATT

NTT

PIRC

Boosting

ATT

NTT

PIRC

0.474

0.461

0.452

0.500

0.467

0.460

0.505

0.509

TREC-8
Averaged Scaled LF1

Averaged Scaled LF2

Boosting

PLT1

PLT2

PIRC

Boosting

CL

PIRC

Mer

0.717

0.712

0.713

0.714

0.722

0.721

0.734

0.720

33

Evolutionary Learning:
Applications in IR - Web-Document Retrieval
z

[Kim & Zhang, 2000]

Link Information,
HTML Tag
Information

Retrieval

<TITLE> <H> <B>

<A>

ww11 ww22 ww33


wn
w1 w2 w3
wwnn

chromosomes

ww11 ww22 ww33


wn
w1 w2 w3
wwnn

Fitness

34

17

Evolutionary Learning:
Applications in IR Tag Weighting
z

Crossover

chromosome X

x1

x2

x3

Mutation

chromosome Y

xn

y1

y2

y3

yn

chromosome X

x1

x2

x3

change value w.p. Pm

zi = (xi + yi ) / 2 w.p. Pc

z1

z2

z3

zn

x1

chromosome Z (offspring)

xn

x2

x3

xn

chromosome X

Truncation selection
35

Evolutionary Learning :
Applications in IR - Experimental Results
z

Datasets
4 TREC-8 Web Track Data
4 2GB, 247491 web documents (WT2g)
4 No. of training topics: 10, No. of test topics: 10

Results

36

18

Reinforcement Learning:
Basic Concept
Agent

1. State st

2. Action at

Reward rt
3. Reward rt+1
Environment
4. State st+1

37

Reinforcement Learning:
Applications in IR - Information Filtering
[Seo & Zhang, 2000]
WAIR

retrieve documents
calculate similarity

2. Actioni

(modify profile)

Rewar
(user profile) di
1. Statei

User profile

3. Rewardi+1 (relevance
feedback)

User
4. Statei+1

Document filtering

...

Filtered documents
38

19

Reinforcement Learning:
Experimental Results (Explicit Feedback)
(%)
39

Reinforcement Learning:
Experimental Results (Implicit Feedback)
(%)
40

20

Вам также может понравиться