Вы находитесь на странице: 1из 95

www.newmediacenter.

ru

: Christopher D. Manning, Prabhakar


Raghavan, Hinrich Schtze, David M. Blei


30

2 74

, ,
,

5/28/2013

3 74

, ,
,

5/28/2013

with word counts

http://wordle.net/ Jonathan
Feinberg

4 74

, ,
,

5/28/2013

5 74

, ,
,

5/28/2013

6 74

, ,
,

5/28/2013

(Information retrieval)

.


.
.

,
,
, .

7 74

, ,
,

5/28/2013

8 74

, ,
,

5/28/2013

9 74

, ,
,

5/28/2013

10 74

, ,
,

5/28/2013

-
Mapping Russian
Twitter
March 20, 2012
By John Kelly, Vladimir
Barash, Karina
Alexanyan, Bruce
Etling, Robert Faris,
Urs Gasser, and John
Palfrey

11 74

, ,
,

5/28/2013

12 74

, ,
,

5/28/2013




: K-



-,
-,

13 74

, ,
,

5/28/2013

(Vector Space Model)


, :

= ( )

= (
)

-
Antony and Cleopatra

Julius Caesar

The Tempest

Hamlet

Othello

Macbeth

Antony

157

73

Brutus

157

Caesar

232

227

Calpurnia

10

Cleopatra

57

mercy

worser

14 74

, ,
,

5/28/2013

:
:



(cosine similarity)

15 74

, ,
,

5/28/2013

J. Jayabharathy, Dr. S. Kanmani, and A. Ayeshaa


Parveen. A Survey of Document Clustering
Algorithms with Topic Discovery. Journal of
Computing, Volume 3, Issue 2, Feb 2011.


(Topic Discovery)

16 74

, ,
,

5/28/2013

17 74

, ,
,

5/28/2013

K- (K-means)


.
:

:

w .

:
: ()

:

18 74

, ,
,

5/28/2013

19 74

, ,
,

5/28/2013

:
(K=2)

20 74

, ,
,

5/28/2013

21 74

, ,
,

5/28/2013

22 74

, ,
,

5/28/2013

23 74

, ,
,

5/28/2013

24 74

, ,
,

5/28/2013

25 74

, ,
,

5/28/2013

26 74

, ,
,

5/28/2013

27 74

, ,
,

5/28/2013

28 74

, ,
,

5/28/2013

29 74

, ,
,

5/28/2013

30 74

, ,
,

5/28/2013

31 74

, ,
,

5/28/2013

32 74

, ,
,

5/28/2013

33 74

, ,
,

5/28/2013

34 74

, ,
,

5/28/2013

35 74

, ,
,

5/28/2013

36 74

, ,
,

5/28/2013

37 74

, ,
,

5/28/2013

38 74

, ,
,

5/28/2013

39 74

, ,
,

5/28/2013

40 74

, ,
,

5/28/2013

41 74

, ,
,

5/28/2013

42 74

, ,
,

5/28/2013

K-

O(M), M
( ).
:
O(KN) , O(KNM).
:
O(NM).
I , :
O(IKNM).

43 74

, ,
,

5/28/2013

K-


()
(
)

44 74

, ,
,

5/28/2013

: ,
.
,
K ,

.

45 74

, ,
,

5/28/2013

K (1)

:
(K = 1)
(= K)



K

46 74

, ,
,

5/28/2013

K (2)

,


RSS(K) (Residual Sum of
Squares)

K
K

RSS(K) + K

. . .

47 74

, ,
,

5/28/2013

: 4 or 9.

48 74

, ,
,

5/28/2013



animal
vertebrate
fish reptile amphib. mammal

invertebrate
worm insect crustacean

, -
-.
-:
.
49 74

, ,
,

5/28/2013


()

50 74

, ,
,

5/28/2013

51 74

, ,
,

5/28/2013

:
?

:
(
)
.

-:
,
.

52 74

, ,
,

5/28/2013

53 74

, ,
,

5/28/2013

54 74

, ,
,

5/28/2013

55 74

, ,
,

5/28/2013

56 74

, ,
,

5/28/2013

-:

= ,

57 74

, ,
,

5/28/2013

, N N .
, :
O(N N)
.
.

.
O(N) , O(N N)
.

O(N3).

O(N2).

58 74

, ,
,

5/28/2013

,


K,

59 74

, ,
,

5/28/2013

, (
)

60 74

, ,
,

5/28/2013


: K-


:

61 74

, ,
,

5/28/2013

,
,

:

:
,
(class labels)
: (purity)

62 74

, ,
,

5/28/2013

: (purity),
i
i

1
(i ) max j (nij )
ni

j C

, n

,
, , f-,
,
63 74

, ,
,

5/28/2013

1: = 1/6 (max(5, 1, 0)) = 5/6


2: = 1/6 (max(1, 4, 1)) = 4/6
3: = 1/5 (max(2, 0, 3)) = 3/5
64 74

, ,
,

5/28/2013

65 74

, ,
,

5/28/2013


()

66 74

, ,
,

5/28/2013

67 74

, ,
,

5/28/2013


(Deerwester et al., 1990)

(LSA)


:
:

68 74

, ,
,

5/28/2013

, LSA, ,
-

69 74

, ,
,

5/28/2013

(LDA)

LDA ()
. ,

,
.

70 74

, ,
,

5/28/2013

LDA




71 74

, ,
,

5/28/2013

.
10-30
.
,
.
200 400
.

72 74

, ,
,

5/28/2013

73 74

, ,
,

5/28/2013

74 74

(API)

, ,
,

5/28/2013

Perl

Python

,

NLTK.

Java

,
, , ..

75 74

, ,
,

5/28/2013

GATE General Architecture for Text Engineering

Mahout

http://mahout.apache.org/
Java,
,

Stanford Topic Modeling Toolbox

http://gate.ac.uk/
Java, ,
, ,

http://nlp.stanford.edu/software/tmt/tmt-0.4/
Java, ,
LDA

Mallet

http://mallet.cs.umass.edu/
Java, , Stanford TMT,
Mahout GATE

76 74

, ,
,

5/28/2013

GATE

77 74

, ,
,

5/28/2013

Stanford Topic Modeling Toolkit

78 74

, ,
,

5/28/2013

TMT on PubMed Data

79 74

, ,
,

5/28/2013

Media Cloud (www.mediacloud.org)

80 74

, ,
,

5/28/2013

Media Cloud Twitter vs LiveJournal

81 74

, ,
,

5/28/2013

Media Cloud Twitter vs LiveJournal


()

82 74

, ,
,

5/28/2013

Media Cloud Twitter vs LiveJournal


()

83 74

, ,
,

5/28/2013

84 74

, ,
,

5/28/2013

Russian Media Cloud

85 74

Russian Media Cloud (2)

86 74

Russian Media Cloud (3)

87 74

Russian Media Cloud (4)

88 74

(1)

UIMA Unstructured Information Management


Architecture

NLTK Natural Language Toolkit

http://www.nltk.org/
Python, ;

API,

LingPipe

http://uima.apache.org/
, , ..
Xml, Eclipse, Java or C++.
.

http://alias-i.com/lingpipe/

RapidMiner

http://rapid-i.com/
RapidMiner

89 74

, ,
,

5/28/2013

(2)

Carrot2

Weka

http://nhttp://glaros.dtc.umn.edu/gkhome/cluto/gcluto/over
view

The Lemur Toolkit

http://www.cs.waikato.ac.nz/ml/weka/

gCluto

http://project.carrot2.org/

http://www.lemurproject.org/

The Semantic Engine, The Semantic Vectors Package,


Terrier IR Platform, .

90 74

, ,
,

5/28/2013


1.

2.
3.


, HTML XML
,

91 74

, ,
,

5/28/2013








4GB

16 GB RAM


92 74

, ,
,

5/28/2013

93 74

, ,
,

5/28/2013



K-,
, LSA LDA



,

94 74

, ,
,

5/28/2013

95 74

, ,
,

5/28/2013

Оценить