Академический Документы
Профессиональный Документы
Культура Документы
Our solution: a web page segmentation based on evaluation of several segmentations by using topic analysis method.
Model
Block coherence measure : It is a co-occurrence measure which is applied inside the bloc to estimate the Architecture
bloc coherence. The block’s content coherence reflects the density of the information linked to a topic and the
degree of correlation between the terms of the block.
great co-occurrence between block terms => high coherence inside the block
Nbdoc (ti , t j )
with Cooccurren ce(ti , t j ) =
Nbdoc (ti ) + Nbdoc (t j ) − Nbdoc (ti , t j )
Nbdoc(t1,..,tn) : the number of documents containing all the terms t1,..,tn.
nt(b) : the number of terms of block b.
Distance measure between adjacent blocks : It is based on the cosine measure which compute a similarity
value between two block vectors. The reverse of cosine measure can be seen as a distance between two blocks.
high similarity between two adjacent blocks => small distance between them .
∑ ∑
n n
wk2,bi × wk2,b j
Dist(bi , b j ) =
1 1 i =1 i =1
(
Sim Vbi ,Vb j
=
)
cos Vbi ,Vb j
=
( ) n
∑w k ,bi × wk ,b j
k =1
bi, bj are blocks and Vbi, Vbj are block vectors of bi and bj respectively.
The weight of each term in a block vector is calculated by using Okapi25 measure.
75
(a)
∑ nb ( g , p )
0,6
P@5
DocRank 50
BlocRank 25
0,5 25
Freq ( g , S ) = P∈ S 0
0,4 0
Topics
nbp ( g , S )
precision
-25 -25
0,3 Topics
0,2
-50 -50
nb(g,p) : the number of occurrences of tag g in the page p DocRank retrieval system.
Fig3. Comparison between DocRank and
nbp(g,S): the number of pages in S that contain the tag g. Conclusion
4. Selecting the most frequent HTML tags as segment delimiters. BlockRank function on precision over the
web page topic segmentation algorithm that we proposed improve
11 standard recall levels
information retrieval by indexing documents more precisely and by
Segment delimiters Number of pages in WT10g subdividing texts into thematically coherent segments.
<P> 264.039 Mesures DocRank BlockRank
<H1>..<H6> 221.049 Map 0,133 0,2112 (+58%) Future Works
<BR> 112.796 P@5 0,180 0,316 (+75%) Exploring techniques of link analysis at block level for web retrieval
(thematic link analysis at blocks level)
<HR> 68.002 P@10 0,172 0,270 (+57%)
Based on thematic block graph, we try to use a new ranking function
No segments 96.743 that combines content and link rank based on propagation of scores over
Tab2. Map, P@5 and P@10 comparison links on block-to-block graph according to query terms.
Tab1. Selected segment delimiters
The 30th Annual International ACM SIGIR Conference 23-27 July 2007, Amsterdam