Graduate Thesis Presentation

w

w

@ w@
O Large collections of documents are becoming increasingly
common and it is important to organize them into structured
ontologies
O Manual construction of structured ontologies is one possible
solution
O Necessity for automated solutions occurs
O Disciplines like text mining (TM) and information retrieval
(IR) have many challenges
@ @ (
(
@ @ @ !
O TM is contend-based and operates with unstructured text
documents, extracting useful information
O TM comprises many different methods (document clustering,
categorization, document indexing)
O IR discipline deals with presentation, storage and methods for
information organization and access
O The principal objective of the IR system is for a certain query
to retrieve all of the relevant documents and the smallest
amount of irrelevant documents
@ @ @ !
O Three classical models for the IR discipline are:
probabilistic, logic and vector space
O Document preprocessing is necessary before creating any
information retrieval model
O In vector space model documents are represented as
vectors of term frequencies and the documents set is
represented as a matrix of document vectors

Õ Õ
§ § §

@ @ @ !
O imilarity measure between document and query S is
defined as cosine of the angle between their vector
representations
O tandard IR evaluation measures are [ [, [
O Measures that combine and r are
[ [
, defined for 11 standard levels of recall, and
@ @ @ !
O ne of the main problems in IR with the vector space model of
documents is the high dimensionality of the term-document
matrix
O The number of documents in document collections may vary
from a few thousand to several hundred thousand, and the
number of terms is often more than a few thousand terms
O Reduction of dimensionality appears to be very useful
@ @ !@
w@
O There are many motives for dimensionality reduction: memory
space reduction, better IR and classification performance,
noise and redundancy elimination
O Dimesionality reduction methods can be performed in two
following manners: Y [ Y (feature selection) or Y [
Y[Y (feature construction)
O Different dimensionality reduction methods for IR are used,
but most popular are Y Y Y (LI) and
Y (CI)
! @w
@ @
O Oses truncated [
Y ( D) of the

term-document matrix P à
O The ranking of documents, according to their relevance for a

query is executed by calculating the score vector
R
P à
w w"@ @
O Oses Y Y (CD) of the term-
term-document
matrix, which is performed in two steps:
clustering documents into clusters P ñ u u u
document projection on the cluster centroids by least square

approximaion M
R R
É
O Like in LI, for a certain query score vector is calculated and

documents are ranked by their relevance
{
P R
@"! @
O ge have implemented LI and CI
O LI is implemented in MATLAB programming language
O CI is developed within three components:
O Intel¶s MKL (BLA and LAPACK libraries)
O Back
Back--end written in C++
O Core algorithm implementation (unmanaged C++)
O grapper as an interface for GOI (managed C++)
O Front
Front--end (GOI) written in C#
w w!!w@
O xperiments are performed on two standard document
collections MDLIN and CRANFILD
O MDLIN contains 1033 articles abstracts related to
medicine, 7014 index term (after preprocessing) and has
defined 30 queries
O CRANFILD contains 1400 documents related to aeronautics,
3763 index terms (after preprocessing) and has defined 225
queries
"@
!
O Decomposition error
MDLIN CRANFILD

2
w

"@
!
O rthonormality of concept vectors
MDLIN CRANFILD

¦ ¦<
M
"@
!
O IR evaluation (without folding-in)
MDLIN CRANFILD
"@
!
O IR evaluation (without folding-in)
MDLIN CRANFILD
"@
!
O IR evaluation (with folding-in) ± only LI in 50 clusters
MDLIN MDLIN
w w!@
O Concept vectors tend towards orthonormality
O Concept vectors are local (have well defined semantic) and singular vectors
are global (are not interpretable)
O Regarding approximation error LI gives slightly better results then CI
O LI gives better results then CI regarding IR evaluation
O Fuzzy is slightly better regarding IR evaluation, but has lower
performances, then pherical clustering algorithm in CI
O Regarding folding-
folding-in documents into document collection without
recomputation or correction of the transformation matrix, best IR
evaluation results are acquired for large percentage of initial documents
(for this example 80%)
O Generally better IR evaluation results are acquired for MDLIN then for
CRANFILD
÷@

@ !@
O recall (r),precision (p), mean average precision
(MAP), F1
Relevant
Documents dru 1 i n document A ± set of retrieved documents
YES NO R ± set of relevant documents for query
Retrieved YES "r "r q
Document NO
r r
RĮ ± intersection of A and R (R
(RĮ = A ŀ
R)
§
!

E
§ !

iE
!
! !
!
É i
§ §
!

!
§
!
§
!

( r ) ma è( ) r r r ... 9 ½ r
r

-

"@
!
O CI performances (number of algorithm iterations)
MDLIN CRANFILD
"@
!
O CI performances (test duration)
MDLIN CRANFILD
"@
!
O CI performances (memory consumption)
MDLIN CRANFILD

Graduate Thesis Presentation

Загружено:

Сведения о документе

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Graduate Thesis Presentation

Загружено:

Авторское право:

Доступные форматы

O The ranking of documents, according to their relevance for a

document projection on the cluster centroids by least square

O Like in LI, for a certain query score vector is calculated and

Вам также может понравиться

Graduate Thesis Presentation

Загружено:

Сведения о документе

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Graduate Thesis Presentation

Загружено:

Авторское право:

Доступные форматы



O The ranking of documents, according to their relevance for a

 document projection on the cluster centroids by least square

O Like in LI, for a certain query score vector is calculated and

Вам также может понравиться

document projection on the cluster centroids by least square

O Like in LI, for a certain query score vector is calculated and