Вы находитесь на странице: 1из 22



  

w   
   

    
  
 
       

  
      
w  
    

   
@ w@
O Large collections of documents are becoming increasingly
common and it is important to organize them into structured
ontologies
O Manual construction of structured ontologies is one possible
solution
O Necessity for automated solutions occurs
O Disciplines like text mining (TM) and information retrieval
(IR) have many challenges
@ @ (
(
@ @ @ !
O TM is contend-based and operates with unstructured text
documents, extracting useful information
O TM comprises many different methods (document clustering,
categorization, document indexing)
O IR discipline deals with presentation, storage and methods for
information organization and access
O The principal objective of the IR system is for a certain query
to retrieve all of the relevant documents and the smallest
amount of irrelevant documents
@ @ @ !
O Three classical models for the IR discipline are:
probabilistic, logic and vector space
O Document preprocessing is necessary before creating any
information retrieval model
O In vector space model documents are represented as
vectors of term frequencies and the documents set is
represented as a matrix of document vectors
     
  
 
    
 
  
Õ  Õ  
§ § §
    ‘
@ @ @ !
O imilarity measure between document  and query S is
defined as cosine of the angle between their vector
representations
O tandard IR evaluation measures are [ [, [  
O Measures that combine  and r are   
[ [  
 , defined for 11 standard levels of recall, and 
@ @ @ !
O —ne of the main problems in IR with the vector space model of
documents is the high dimensionality of the term-document
matrix
O The number of documents in document collections may vary
from a few thousand to several hundred thousand, and the
number of terms is often more than a few thousand terms
O Reduction of dimensionality appears to be very useful
@ @ !@
w@
O There are many motives for dimensionality reduction: memory
space reduction, better IR and classification performance,
noise and redundancy elimination
O Dimesionality reduction methods can be performed in two
following manners: Y [  Y (feature selection) or Y [
Y[Y (feature construction)
O Different dimensionality reduction methods for IR are used,
but most popular are Y Y  Y   (LI) and
  Y   (CI)
!  @w
@ @ 
O Oses truncated  [
    Y ( D) of the

term-document matrix  P   à

O The ranking of documents, according to their relevance for a


query is executed by calculating the score vector
R 
P   à
w w"@ @ 
O Oses   Y   Y (CD) of the term-
term-document
matrix, which is performed in two steps:
‡ clustering documents into  clusters „ P ñ u  u u „

‡ document projection on the cluster centroids by least square


approximaion  M 
R ’˜R
  É

O Like in LI, for a certain query score vector is calculated and


documents are ranked by their relevance
{
 P R
@"! @
O ge have implemented LI and CI
O LI is implemented in MATLAB programming language
O CI is developed within three components:
O Intel¶s MKL (BLA and LAPACK libraries)
O Back
Back--end written in C++
O Core algorithm implementation (unmanaged C++)
O grapper as an interface for GOI (managed C++)
O Front
Front--end (GOI) written in C#
w w!!w@ 
O xperiments are performed on two standard document
collections MDLIN and CRANFILD
O MDLIN contains 1033 articles abstracts related to
medicine, 7014 index term (after preprocessing) and has
defined 30 queries
O CRANFILD contains 1400 documents related to aeronautics,
3763 index terms (after preprocessing) and has defined 225
queries
"@  
!
O Decomposition error

MDLIN CRANFILD

 ’          ’  2
 w

"@  
!
O —rthonormality of concept vectors

MDLIN CRANFILD
„ „
¦  ¦<
„ M„ ˜ ˜   ˜
"@  
!
O IR evaluation (without folding-in)

MDLIN CRANFILD
"@  
!
O IR evaluation (without folding-in)

MDLIN CRANFILD
"@  
!
O IR evaluation (with folding-in) ± only LI in 50 clusters

MDLIN MDLIN
w w!@ 
O Concept vectors tend towards orthonormality
O Concept vectors are local (have well defined semantic) and singular vectors
are global (are not interpretable)
O Regarding approximation error LI gives slightly better results then CI
O LI gives better results then CI regarding IR evaluation
O Fuzzy is slightly better regarding IR evaluation, but has lower
performances, then pherical clustering algorithm in CI
O Regarding folding-
folding-in documents into document collection without
recomputation or correction of the transformation matrix, best IR
evaluation results are acquired for large percentage of initial documents
(for this example 80%)
O Generally better IR evaluation results are acquired for MDLIN then for
CRANFILD
÷@ 


@ !@
O recall (r),precision (p), mean average precision
(MAP), F1
Relevant
Documents dru 1 ” i ” n document ‡ A ± set of retrieved documents
YES NO ‡ R ± set of relevant documents for query
Retrieved YES "r "r q
Document NO  
r r
‡ RĮ ± intersection of A and R (R
(RĮ = A ŀ
  R)
§ 
 !˜

E
§ !˜

iE
!  
! !  
!
É i
§   § 
 !˜

 !˜
 §
 !˜
  §  
 !˜

˜ ˜
 ( r )  ma è( ) r   r r ... 9  ½  r 
˜˜ r
˜
     - ˜
˜ ˜
  ˜   

"@  
!
O CI performances (number of algorithm iterations)

MDLIN CRANFILD
"@  
!
O CI performances (test duration)

MDLIN CRANFILD
"@  
!
O CI performances (memory consumption)

MDLIN CRANFILD

Вам также может понравиться