Вы находитесь на странице: 1из 1

A Web Page Topic Segmentation Algorithm Based on

Visual Criteria and Content Presentation


Idir Chibane, Bich-Liên Doan
SUPELEC, École Supérieure d’Électricité
Plateau de Moulon, 3 rue Joliot Curie, 91192, Gif-sur-Yvette, France
{Idir.Chibane, Bich-Lien.Doan}@supelec.fr

Context : Page segmentation based on HTML tags for Information Retrieval.


Problematic at page level Hypothesis
Multiple-topic and varying-length of web pages are two negative factors significantly decreasing A web page with a region of high density of matched terms is more relevant than a
the performance of web search. web page with matched terms distributed across the entire page.
The content of a web page contains multiple regions with unrelated topics (ex. navigation bars). The visual and layout structure can help the user to unconsciously divide the web
page into several semantics parts.
Problematic of existing page segmentation algorithms based on HTML tags The textual contents of a page follow sequential organization of the page topic.
It is difficult to obtain the appropriate page segments based on specific HTML tags because : The possible sequences of various blocks in a web page follow coherence criteria.
HTML tags are not used in different cases as segment delimiters, for example, tags such Topic analysis which indicates how topics change within the text, is based on
as <TR>, <BR> and <P> are used for layout structuring and content presentation. boundary delimitation. In the case of World Wide Web, the page HTML structure
The authors of pages use different HTML tags in order to separate possible segments. Fig 1. Example of two segmentation solutions
for the same web page using different segment
offers the possibility for dividing a web page into a set of blocks by using HTML tags.
The segment delimiters are random and do not depend on specific rules to respect. delimiters (paragraph<P> and Vertical line)

Our solution: a web page segmentation based on evaluation of several segmentations by using topic analysis method.

Model
Block coherence measure : It is a co-occurrence measure which is applied inside the bloc to estimate the Architecture
bloc coherence. The block’s content coherence reflects the density of the information linked to a topic and the
degree of correlation between the terms of the block.
great co-occurrence between block terms => high coherence inside the block

Coh (b ) = Cooccurren ce(ti , t j )


1
nt (b )
2 ∑ ∑
t i ∈b t j ∈b

Nbdoc (ti , t j )
with Cooccurren ce(ti , t j ) =
Nbdoc (ti ) + Nbdoc (t j ) − Nbdoc (ti , t j )
Nbdoc(t1,..,tn) : the number of documents containing all the terms t1,..,tn.
nt(b) : the number of terms of block b.

Distance measure between adjacent blocks : It is based on the cosine measure which compute a similarity
value between two block vectors. The reverse of cosine measure can be seen as a distance between two blocks.
high similarity between two adjacent blocks => small distance between them .

∑ ∑
n n
wk2,bi × wk2,b j
Dist(bi , b j ) =
1 1 i =1 i =1

(
Sim Vbi ,Vb j
=
)
cos Vbi ,Vb j
=
( ) n

∑w k ,bi × wk ,b j
k =1

bi, bj are blocks and Vbi, Vbj are block vectors of bi and bj respectively.
The weight of each term in a block vector is calculated by using Okapi25 measure.

Segmentation evaluation function :


 ∑ Coh (bi )  ∑ Dist (bi , bi +1 )
 1≤ i ≤ nb ( P )   1≤ i ≤ nb ( P )−1 
SegmEvalFu nct (S i , P ) =  *
nb (P )   nb (P ) − 1 
   
Si : a segmentation solution of the page P based on visual and format HTML tags. Fig 2. Topic segmentation process based on
nb(P) : number of blocks extracted from P according to the solution of segmentation Si. visual and format HTML tags
The selected segmentation solution has a great value of the evaluation function.

Construction of a list of segment delimiters Experimental Setup and Results 100

75
(a)

(visual and format tags) 50


Map

Experimental evaluation of Okapi retrieval system on both 25


We have studied the distribution of some HTML tags in a set of pages
page level (DocRank) and block level (BlockRank). 0
At block level, we choose the most relevant block in the -25 Topics
1. Selecting HTML tags which can be seen as segment delimiters.
page as a relevance score of this page. -50
2. Choosing a set of random pages from WT10g collection
100
3. Computing a frequency value for each HTML tag using the following measure: 75 (c) (b)
75
50
P@10

∑ nb ( g , p )
0,6
P@5

DocRank 50
BlocRank 25
0,5 25

Freq ( g , S ) = P∈ S 0
0,4 0
Topics
nbp ( g , S )
precision

-25 -25
0,3 Topics
0,2
-50 -50

S : the set of 500 random pages. 0,1


Fig4. Per-topic gain in mean average precision (a),
g : an HTML tag. 0
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
P@5 (b) and P@10 (c) compared BlockRank to
Freq(g,S) : the frequency of tag g in S, 11 standards recall levels

nb(g,p) : the number of occurrences of tag g in the page p DocRank retrieval system.
Fig3. Comparison between DocRank and
nbp(g,S): the number of pages in S that contain the tag g. Conclusion
4. Selecting the most frequent HTML tags as segment delimiters. BlockRank function on precision over the
web page topic segmentation algorithm that we proposed improve
11 standard recall levels
information retrieval by indexing documents more precisely and by
Segment delimiters Number of pages in WT10g subdividing texts into thematically coherent segments.
<P> 264.039 Mesures DocRank BlockRank
<H1>..<H6> 221.049 Map 0,133 0,2112 (+58%) Future Works
<BR> 112.796 P@5 0,180 0,316 (+75%) Exploring techniques of link analysis at block level for web retrieval
(thematic link analysis at blocks level)
<HR> 68.002 P@10 0,172 0,270 (+57%)
Based on thematic block graph, we try to use a new ranking function
No segments 96.743 that combines content and link rank based on propagation of scores over
Tab2. Map, P@5 and P@10 comparison links on block-to-block graph according to query terms.
Tab1. Selected segment delimiters

The 30th Annual International ACM SIGIR Conference 23-27 July 2007, Amsterdam

Вам также может понравиться