Академический Документы
Профессиональный Документы
Культура Документы
Semantics
Language Linguistics
Lexis
Lexical semantics
Statistical semantics
Structural semantics
Prototype semantics
Lexicology
Semantic analysis
Theory of descriptions
Force dynamics
Unsolved problems
Semantic matching
Analysis (machine)
Semantic Web
Semantic wiki
Abstract interpretation
Formal semantics of
programming languages
Denotational semantics
Axiomatic semantics
Operational semantics
Action semantics
Categorical semantics
Concurrency semantics
Game semantics
Predicate transformer
It has been suggested that Latent semantic indexing be merged into this article or section. (Discuss) Proposed
Latent semantic analysis (LSA) is a technique in natural language processing, in particular in vectorial semantics, of analyzing relationships between a set of documents and the terms they
contain by producing a set of concepts related to the documents and terms. LSA assumes that words that are close in meaning will occur in similar pieces of text. A matrix containing word
counts per paragraph (rows represent unique words and columns represent each paragraph) is constructed from a large piece of text and a mathematical technique called singular value
decomposition (SVD) is used to reduce the number of columns while preserving the similarity structure among rows. Words are then compared by taking the cosine of the angle between the
two vectors formed by any two rows. Values close to 1 represent very similar words while values close to 0 represent very dissimilar words. [1]
LSA was patented in 1988 (US Patent 4,839,853) by Scott Deerwester, Susan Dumais, George Furnas, Richard Harshman,Thomas Landauer, Karen Lochbaum and Lynn Streeter. In the
context of its application to information retrieval, it is sometimes called Latent Semantic Indexing (LSI).[2]
Contents
[show]
[edit]Overview
[edit]Occurrence
matrix
LSA can use a term-document matrix which describes the occurrences of terms in documents; it is a sparse matrix whose rows correspond to terms and whose columns correspond to
documents. A typical example of the weighting of the elements of the matrix is tf-idf (term frequencyinverse document frequency): the element of the matrix is proportional to the number of
times the terms appear in each document, where rare terms are upweighted to reflect their relative importance.
This matrix is also common to standard semantic models, though it is not necessarily explicitly expressed as a matrix, since the mathematical properties of matrices are not always used.
[edit]Rank
lowering
After the construction of the occurrence matrix, LSA finds a low-rank approximation[3] to the term-document matrix. There could be various reasons for these approximations:
The original term-document matrix is presumed too large for the computing resources; in this case, the approximated low rank
matrix is interpreted as an approximation (a "least and necessary evil").
The original term-document matrix is presumed noisy: for example, anecdotal instances of terms are to be eliminated. From this
point of view, the approximated matrix is interpreted as a de-noisified matrix (a better matrix than the original).
The original term-document matrix is presumed overly sparse relative to the "true" term-document matrix. That is, the original
matrix lists only the words actually in each document, whereas we might be interested in all words related to each document
generally a much larger set due to synonymy.
The consequence of the rank lowering is that some dimensions are combined and depend on more than one term:
This mitigates the problem of identifying synonymy, as the rank lowering is expected to merge the dimensions associated with
terms that have similar meanings. It also mitigates the problem with polysemy, since components of polysemous words that point
in the "right" direction are added to the components of words that share a similar meaning. Conversely, components that point in
other directions tend to either simply cancel out, or, at worst, to be smaller than components in the directions corresponding to the
intended sense.
[edit]Derivation
Let
the frequency).
in document
Now a row in this matrix will be a vector corresponding to a term, giving its relation to each document:
Likewise, a column in this matrix will be a vector corresponding to a document, giving its relation to each term:
between two term vectors gives the correlation between the terms over
). Likewise,
contains the dot products between all the document vectors, giving their
such that
and
are orthogonal
The matrix products giving us the term and document correlations then become
Since
and
the eigenvectors of
, while
must contain
Both products have the same non-zero eigenvalues, given by the non-zero entries
of
The values
and
and
contributes to
is the
is the
column,
row.
that
approximation has a minimal error. But more importantly we can now treat the term
and document vectors as a "semantic space". The vector
then has
entries
is an
and
and
Comparing terms
vectors
and
by comparing the
and
To do the latter, you must first translate your query into the low dimensional
space. It is then intuitive that you must use the same transformation that
you use on your documents:
may be
it with the document vectors in the low dimensional space. You can
do the same for pseudo term vectors:
[edit]Applications
[edit]Commercial
applications
[edit]Applications
in human
memory
The use of Latent Semantic Analysis has
been prevalent in the study of human
memory, especially in areas of free
recall and memory search. There is a
positive correlation between the semantic
similarity of two words (as measured by
LSA) and the probability that the words
would be recalled one after another in free
recall tasks using study lists of random
common nouns. They also noted that in
these situations, the inter-response time
between the similar words was much
quicker than between dissimilar words.
These findings are referred to as
the Semantic Proximity Effect.[6]
[7]
[edit]Implementation
[edit]Limitations
the (1.3452 * car + 0.2828 * truck) component could be interpreted as "vehicle". However, it is very likely that cases close to
LSA cannot
capture pol
ysemy (i.e.,
multiple
meanings
of a word).
Each
occurrence
of a word is
treated as
having the
same
meaning
due to the
word being
represente
d as a
single point
in space.
For
example,
the
occurrence
of "chair" in
a document
containing
"The Chair
of the
Board" and
in a
separate
document
containing
"the chair
maker" are
considered
the same.
The
behavior
results in
the vector
representat
ion being
an average
of all the
word's
different
meanings
in the
corpus,
which can
make it
difficult for
comparison
. However,
the effect is
often
lessened
due to
words
having
a predomin
ant
sense throu
ghout a
corpus (i.e.
not all
meanings
are equally
likely).
Limitations
of bag of
words
model (BO
W), where
a text is
represente
d as an
unordered
collection
of words.
The probab
ilistic
model of
LSA does
not match
observed
data: LSA
assumes
that words
and
documents
form a
joint Gaussi
an model
(ergodic
hypothesis)
, while
a Poisson
distribution
has been
observed.
Thus, a
newer
alternative
is probabili
stic latent
semantic
analysis,
based on
a multinomi
al model,
which is
reported to
give better
results than
standard
LSA.[11]
[edit]See
also
Compound
term
processing
Explicit
semantic
analysis
Latent
semantic
mapping
Latent
Semantic
Structure
Indexing
Principal
component
s analysis
Probabilisti
c latent
semantic
analysis
Spamdexin
g
Topic
model
Lat
ent
Diri
chl
et
allo
cati
on
Vectorial
semantics
Coh-Metrix
[edit]Referenc
es
1.
^S
usa
n T.
Du
mai
s
(20
05).
"Lat
ent
Se
ma
ntic
Ana
lysi
s".
Ann
ual
Rev
iew
of
Info
rma
tion
Sci
enc
e
and
Tec
hno
log
y 38
:
188
. doi
:10.
100
2/ar
is.1
440
380
105
.
2.
^ "T
he
Lat
ent
Se
ma
ntic
Ind
exin
g
ho
me
pag
e".
3.
^M
ark
ovs
ky I.
(20
12)
Low
Ran
k
App
roxi
mati
on:
Alg
orit
hms
,
Impl
em
ent
atio
n,
App
licat
ions
,
Spri
nge
r,
201
2, I
SB
N
978
-1447
1222
65[pag
e need
ed]
4.
^ Al
ain
Lifc
hitz,
San
dra
Jhe
anLar
ose,
Guy
Den
hir
e
(20
09).
"Eff
ect
of
tun
ed
par
am
eter
s on
an
LSA
mult
iple
choi
ce
que
stio
ns
ans
weri
ng
mo
del"
(PD
F).
Beh
avi
or
Res
ear
ch
Met
hod
s41
(4):
120
1
120
9. d
oi:1
0.3
758
/BR
M.4
1.4.
120
1. P
MID
198
978
29.
5.
^G
erry
J.
Elm
an
(Oct
obe
r
200
7).
"Aut
om
ate
d
Pat
ent
Exa
min
atio
n
Sup
port
-A
pro
pos
al".
Biot
ech
nol
ogy
La
w
Re
port
26 (
5):
435
. doi
:10.
108
9/blr
.20
07.
989
6
6.
^M
arc
W.
Ho
war
d
and
Mic
hael
J.
Kah
ana
(19
99)
(PD
F).
Co
nte
xtu
al
Vari
abili
ty
and
Seri
al
Pos
itio
n
Effe
cts
in
Fre
e
Rec
all.
7.
^ Fr
ankl
in
M.
Zar
om
b et
al.
(20
06)
(PD
F).
Te
mp
oral
Ass
oci
atio
ns
and
Prio
rList
Intr
usi
ons
in
Fre
e
Rec
all.
8.
^N
elso
n,
Dou
glas
. "T
he
Uni
vers
ity
of
Sou
th
Flor
ida
Wor
d
Ass
ocia
tion,
Rhy
me
and
Wor
d
Fra
gm
ent
Nor
ms"
.
Retr
ieve
d
5/8/
201
1.
9.
^G
ene
viv
e
Gor
rell
and
Bra
ndy
n
We
bb
(20
05).
"Ge
ner
aliz
ed
Heb
bian
Alg
orit
hm
for
Lat
ent
Se
ma
ntic
Ana
lysi
s" (
PD
F). I
nter
spe
ech'
200
5.
10.
^M
atth
ew
Bra
nd
(20
06).
"Fa
st
Low
Ran
k
Mo
dific
atio
ns
of
the
Thi
n
Sin
gula
r
Val
ue
Dec
om
posi
tion
" (P
DF)
. Li
nea
r
Alg
ebr
a
and
Its
App
licat
ion
s 41
5:
20
30.
doi:
10.
101
6/j.l
aa.
200
5.0
7.0
21.
11.
^T
ho
mas
Hof
ma
nn
(19
99).
"Pro
babi
listi
c
Lat
ent
Se
ma
ntic
Ana
lysi
s" (
PD
F).
Unc
erta
inty
in
Artif
icial
Inte
llige
nce.
Thomas
Landauer,
Peter W.
Foltz, &
Darrell
Laham
(1998). "Int
roduction to
Latent
Semantic
Analysis" (
PDF). Disc
ourse
Processes
25 (23):
259
284.doi:10.
1080/0163
853980954
5028.
Scott
Deerwester
, Susan T.
Dumais, G
eorge W.
Furnas, Th
omas K.
Landauer,
Richard
Harshman (
1990). "Ind
exing by
Latent
Semantic
Analysis" (
PDF). Jour
nal of the
American
Society for
Information
Science 41
(6): 391
407. doi:10.
1002/
(SICI)10974571(1990
09)41:6<39
1::AIDASI1>3.0.C
O;2-9. Origi
nal article
where the
model was
first
exposed.
Michael
Berry, Susa
n T.
Dumais,
Gavin W.
O'Brien
(1995). Usi
ng Linear
Algebra for
Intelligent
Information
Retrieval. (
PDF).
Illustration
of the
application
of LSA to
document
retrieval.
"Latent
Semantic
Analysis".
InfoVis.
Fridolin
Wild
(November
23,
2005). "An
Open
Source
LSA
Package
for R".
CRAN.
Retrieved
2006-1120.
Thomas
Landauer,
Susan T.
Dumais. "A
Solution to
Plato's
Problem:
The Latent
Semantic
Analysis
Theory of
Acquisition,
Induction,
and
Representa
tion of
Knowledge
". Retrieved
2007-0702.
[edit]External
links
[edit]Articles
on LSA
Latent
Semantic
Analysis, a
scholarpedi
a article on
LSA written
by Tom
Landauer,
one of the
creators of
LSA.
[edit]Talks
and
Demonstrati
ons
LSA
Overview,
talk by
Prof. Thom
as
Hofmann d
escribing
LSA, its
application
s in
Information
Retrieval,
and its
connection
s
to probabili
stic latent
semantic
analysis.
Complete
LSA
sample
code in C#
for
Windows.
The demo
code
includes
enumeratio
n of text
files,
filtering
stop words,
stemming,
making a
documentterm matrix
and SVD.
[edit]Implemen
tations
Due to its crossdomain
applications
in Information
Retrieval, Natural
Language
Processing (NLP),
Cognitive
Science and Comp
utational
Linguistics, LSA
has been
implemented to
support many
different kinds of
applications.
Sense
Clusters,
an
Information
Retrievaloriented
perl
implementa
tion of LSA
S-Space
Package, a
Computatio
nal
Linguistics
and
Cognitive
Scienceoriented
Java
implementa
tion of LSA
Semantic
Vectors ap
plies
Random
Projection,
LSA, and
Reflective
Random
Indexing
to Lucene t
ermdocument
matrices
Infomap
Project, an
NLPoriented C
implementa
tion of LSA
(supersede
d by
semanticve
ctors
project)
Text to
Matrix
Generator,
A MATLAB
Toolbox for
generating
termdocument
matrices
from text
collections,
with
support for
LSA
Gensim co
ntains a
fast, online
Python
implementa
tion of LSA
for matrices
larger than
RAM.
Rate this
page
What's this?
Trustworthy
Objective
Complete
Well-written
I am highly
knowledgeable
about this topic
(optional)
Submit ratings
Categories:
Information
retrieval
Natural language
processing
Latent variable
models
Navigation menu
Create account
Log in
Article
Talk
Read
Edit
View history
Main page
Contents
Featured content
Current events
Random article
Donate to Wikipedia
Interaction
Help
About Wikipedia
Community portal
Recent changes
Contact Wikipedia
Toolbox
Print/export
Languages
Deutsch
Franais
Svenska
Text is available
under the Creative
Commons
AttributionShareAlike License;
additional terms may
apply. See Terms of
Use for details.
Wikipedia is a
registered trademark
of the Wikimedia
Foundation, Inc., a
non-profit
organization.
Privacy policy
About Wikipedia
Disclaimers
Mobile view
Contact us