Академический Документы
Профессиональный Документы
Культура Документы
w
w
@ w@
O Large collections of documents are becoming increasingly
common and it is important to organize them into structured
ontologies
O Manual construction of structured ontologies is one possible
solution
O Necessity for automated solutions occurs
O Disciplines like text mining (TM) and information retrieval
(IR) have many challenges
@ @ (
(
@ @ @ !
O TM is contend-based and operates with unstructured text
documents, extracting useful information
O TM comprises many different methods (document clustering,
categorization, document indexing)
O IR discipline deals with presentation, storage and methods for
information organization and access
O The principal objective of the IR system is for a certain query
to retrieve all of the relevant documents and the smallest
amount of irrelevant documents
@ @ @ !
O Three classical models for the IR discipline are:
probabilistic, logic and vector space
O Document preprocessing is necessary before creating any
information retrieval model
O In vector space model documents are represented as
vectors of term frequencies and the documents set is
represented as a matrix of document vectors
Õ Õ
§ § §
@ @ @ !
O imilarity measure between document and query S is
defined as cosine of the angle between their vector
representations
O tandard IR evaluation measures are [ [, [
O Measures that combine and r are
[ [
, defined for 11 standard levels of recall, and
@ @ @ !
O ne of the main problems in IR with the vector space model of
documents is the high dimensionality of the term-document
matrix
O The number of documents in document collections may vary
from a few thousand to several hundred thousand, and the
number of terms is often more than a few thousand terms
O Reduction of dimensionality appears to be very useful
@ @ !@
w@
O There are many motives for dimensionality reduction: memory
space reduction, better IR and classification performance,
noise and redundancy elimination
O Dimesionality reduction methods can be performed in two
following manners: Y [ Y (feature selection) or Y [
Y[Y (feature construction)
O Different dimensionality reduction methods for IR are used,
but most popular are Y Y Y (LI) and
Y (CI)
! @w
@ @
O Oses truncated [
Y ( D) of the
term-document matrix P à
MDLIN CRANFILD
2
w
"@
!
O rthonormality of concept vectors
MDLIN CRANFILD
¦ ¦<
M
"@
!
O IR evaluation (without folding-in)
MDLIN CRANFILD
"@
!
O IR evaluation (without folding-in)
MDLIN CRANFILD
"@
!
O IR evaluation (with folding-in) ± only LI in 50 clusters
MDLIN MDLIN
w w!@
O Concept vectors tend towards orthonormality
O Concept vectors are local (have well defined semantic) and singular vectors
are global (are not interpretable)
O Regarding approximation error LI gives slightly better results then CI
O LI gives better results then CI regarding IR evaluation
O Fuzzy is slightly better regarding IR evaluation, but has lower
performances, then pherical clustering algorithm in CI
O Regarding folding-
folding-in documents into document collection without
recomputation or correction of the transformation matrix, best IR
evaluation results are acquired for large percentage of initial documents
(for this example 80%)
O Generally better IR evaluation results are acquired for MDLIN then for
CRANFILD
÷@
@ !@
O recall (r),precision (p), mean average precision
(MAP), F1
Relevant
Documents dru 1 i n document A ± set of retrieved documents
YES NO R ± set of relevant documents for query
Retrieved YES "r "r q
Document NO
r r
RĮ ± intersection of A and R (R
(RĮ = A ŀ
R)
§
!
E
§ !
iE
!
! !
!
É i
§ §
!
!
§
!
§
!
( r ) ma è( ) r r r ... 9 ½ r
r
-
"@
!
O CI performances (number of algorithm iterations)
MDLIN CRANFILD
"@
!
O CI performances (test duration)
MDLIN CRANFILD
"@
!
O CI performances (memory consumption)
MDLIN CRANFILD