You are on page 1of 301

Building Machine

Learning Systems
with Python
Second Edition
Get more from your data through creating
practical machine learning systems with
Python
Luis Pedro Coelho
Willi Richert

PACKT

PUBLISHING
Rll{MINGHAM MUMBI

::J?un;ee,;5:illdc


wr
Python


Python

, 2016

004.438Python:004.6
32.973.22

76

76

;1 ,
r 1ia Pytho11. 2-
/ 11. . . . - .: , 2016. - 302 .: .
ISBN 978-5-97060-330-7
1111 11
- ,
11 . Python -
11 .

.
1 , Pytho11

.

NumPy, SciPy, scikit-learn.
1,
, , ,
, .
n, Python
r 1
,
.

Original E11glisl1 l1gage editio11 p11blisl1ed Pt1lisl1e(I Packt Pulisl1ing Ltd.,


Livery Place, 35 Livc1")' Street, Bi11i11gliam 83 2. UK. Copyright 2015 Packt
Pulisl1ing. R11ssia11-la11gige cdition copyrigl1t () 2015 DM Press. AII rights reserve(\.
. 1, ii 11111 nn11n
11 111r
11 n n1 1111.
11, 111111 111ii , 11<110 11poucpe11. , 11,
1111111 1 n, n 1
11 1 II n11 111 n. tn:.111
TIIM 1,n 11 11111 II 1111. IIC
11 11111.

ISBN 978-1-78439-277-2 (.)


JSBN 978-5-97060-330-7 (.)

Copy1igl1t 2015 Packt Publishing


, pycci1ii ,
, 2016


...................................................... 11
.................................................. 13
................................................... 15
...................................................................... 15
.......................................... 17
........................................................ 17
................................................................ 17
.......................................................................................... 18
..................................................................... 18
....................................................................... 19
............................................................................................... 19
................................................................... 19

........................................................................................ 20

1.
Python ................................................ 21
Python ............................. 22
( ) ........................... 23
, ........................................................ 25
....................................................................... 26
NumPy, SciPy matplotlib .................................................... 26
Python .................................................................................. 27
NumPy SciPy
.................................................... 27
NumPy..................................................................................... 27
SciPy ....................................................................................... 32

() ........33
...................................................................................... 33
..................................... 35
........................... 36

......................................................................................... 46

2. ............ 47
Iris........................................................................... 48
....................................................... 48
......................................... 50

............................................................................................... 53

............................... 57
.... 58
Seeds ............................................................................. 58
......................................................... 59
................................................ 60

scikit-learn .......................................... 61
.............................................................................. 62

................................. 65
......................................................................................... 66

3.
...................................................... 68
................................................... 69
................................................................................ 69
.................................................................................... 70


......................................................................... 71
..................................... 71
- ............................................................. 80
................................................... 81

.............................................................................. 82
K- .................................................................................. 83
.......................................... 85
................................................................... 87

............................................................. 88
........................................................................... 90
......................................................................... 92

......................................................................................... 92

4. ................. 93
................................................... 93
................................................ 95
................................................. 100
......................................................... 103

......................................................................... 106

....................................................................................... 107

5.
......................................................... 109
............................................................................. 109
.............................. 110
............................................................................ 110
................................................................. 110

....................................................................... 111
............................................................... 112
.............................. 112
? ........................................................... 114

............................................. 115
k ............................................................... 115
........................................................................ 116
.................................................................. 117
................................................. 117
...................................... 118

? ................................................................... 121
- .......................................................... 122
......................................................... 122
.......................................................... 123
? ......................................................... 123

............................................................ 125
.......................................... 126
........ 128

................................. 129
........................................................ 133
! ......................................................................... 134
....................................................................................... 135

6. II
......................................................... 136
............................................................................. 136
.......................................................... 137
....................... 137
............................................................................... 138
.................................................................... 139

............................................................................. 140
......................... 143
...................................... 144

....................................... 147
.......................................................... 147
............................................................... 150
............................................. 153

............................................................................ 157
........................................................................... 159
..................................................................... 159
SentiWordNet ............................................ 162
.......................................................................... 164
........................................................................... 166

....................................................................................... 167

7. .......................................... 168
......... 168
..................................................................... 172
............................................... 173

, ............ 174
L1 L2................................................................................... 175
Lasso scikit-learn ................................ 176
Lasso .................................................................. 177
P--N ........................................................................ 178
, ................................... 179
.............................. 181

....................................................................................... 185

8. .................................... 186
............................................. 186
.................................... 188
.......................................................... 189
................................. 191
......................................... 195
................................................ 196

........................................................................... 199
.......................................................... 200
.............................................. 201
.............................................................. 204
.......................................................... 206

....................................................................................... 207

9.
......................................................... 208
............................................................................. 208
......................................................... 209
WAV ........................................................... 209

......................................................................... 210
............................................... 211

....... 213
.................................................... 213
.................................................................. 215

.................................................................. 215

................................. 218

-
..................................................... 220
....................................................................................... 225

10. ............................. 227


............................................ 227
............................................................ 228
....................................................................................... 230
.......................................................................... 231
................................................................. 233
................................................ 235
............................................. 236
...................................................... 237
............. 239
............................... 240

......................................... 242
....................................................................................... 246

11. .................... 248


............................................................................. 249
......................................................................... 249
.................... 250
........ 257
....................................................... 259

................................................................. 260
........................................................... 260
PCA LDA ......................................... 263

....................................................... 264
....................................................................................... 267

10. ......................... 269


.......................................................... 269
jug .......................... 270
jug ......................................................................... 271

10

............................................................................. 273
jug ................................................... 275
.............................. 278

Amazon Web Services ..................................................... 279


.......................................................... 281
Python- Amazon Linux.......................................... 285
jug .......................................................... 286
StarCluster ..... 287

....................................................................................... 291


..................................... 293
...................................................................... 293
.......................................................................................... 293
- .......................................................... 294
.......................................................................................... 294
....................................................................... 295
.................................................................... 295
................................................................... 295
....................................................................................... 296

.................................... 297


(Luis Pedro Coelho) , ,
.
, .
. .
,
. .
1998 ,
, . 2004
Python
. Python mahotas,
.
, .

, .
(Willi Richert) .
,
. Microsoft
Bing, , ,
.

12


, .
(Andreas Bode), (Clemens
Marschner), (Hongyan Zhou)
(Eric Crestan),
(Tomasz Marciniak), (Cristian Eigel), (Oliver Niehoerster) (Philipp
Adelt). , , ,
.



(Ecole Suprieure d'Electricit) ( , ),
, .
.
HT Python 2003 .
. , ,
, . Computational and Mathematical Biology
The Python Papers.
, AdvanceSyn
Pte. Ltd., .
,
.
, ,
. http://maurice.
vodien.com, LinkedIn http://www.linkedin.
com/in/mauriceling.

. Seznam.cz, .
,
, -

14

RaRe
Consulting Ltd.

:
.
, gensim smart_open.
, . ,


.


- , ( ) . , , .
. - ,
- . , , , , .
, . ,
? ?
.
?
?
, .
. , .


1 Python

. , , .
2 ,
.
3
,
, .

16

4 ,
, .
5 ,
- , ,
.
6 II
, , .
7 , ,

. , Lasso .
8 . , , , (
).
9 , -
, . ,
, .
10 , .
,
.
11 , ,
.
12 , ,
.
(
Amazon Web Services).

17


, .

, Python
easy_install pip. .
, :
Python 2.7 ( 3.3 3.4);
NumPy 1.8.1;
SciPy 0.13;
scikit-learn 0.14.0.


, Python .
, .
, Python .
Python ,
C C++ . ,
.


. .
, , , URL-, , ,

18

: poly1d()
.
:
[aws info]
AWS_ACCESS_KEY_ID = AAKIIT7HHF6IUSN3OCAA
AWS_SECRET_ACCESS_KEY = < >

:
>>> import numpy
>>> numpy.version.full_version
1.8.1


. ,
:
Change instance type.
.

. ,
, , . , , .
,
feedback@packtpub.com, .

, www.packtpub.com/authors.


Packt ,
.

19


Packt,
http://www.packtpub.com. , http://www.packtpub.com/
support, ,
.
GitHub
https://github.com/luispedro/BuildingMachineLearningSystemsWithPython. ,


Python .

, -
. , , , .
.
http://www.packtpub.com/support,
, Errata Submission Form . , Errata .
,
http://www.packtpub.com/books/content/
support. Errata.
www.TwoToReal.com,
.



. Packt .
, , -,
.

20


copyright@packtpub.com.


.

- ,
questions@packtpub.com,
.

1.

Python
, . . , .
, ,
. ,
, . :
?
, ,
, . , .
, : ?
? , ?
?
! ()
, ,
. , , .
,
. , , ,
. , ,
,
.

22

1. Python

Python

(, )
, (
). , , , :
, . -
, .

,
. .
, . -
, ( , ) .
,
, .
.
, ( ) .
,
, . , , .
(
www.kaggle.com, ?), , .
, . , , .
Python . Python,
, . , . , C
.
, C, .

( )

23


( )
, , , .
, , ,
, ,
, . :
;
;
, ;
;
.
. , , , , - ,
.
. ,
, , , . ,
, , , .
, , - .
? ,
? , , ? , .
,
,
.
. ,
,
.

24

1. Python

, (feature engineering),
,
.
.
, ,
-
( , - ).

.
, ?
,
?

?
, ,
.
, , ,
. , ,
. ,
.
,
. ,
,
. , , ,
, .

,
(
),
, .

.
. ,

.
NumPy
SciPy Python
scikit-learn.
,
.

25

,
.

,
, ,
. ,
, . : , .
. , - , .
http://metaoptimize.com/qa: . .
,
.
http://stats.stackexchange.com:
Cross Validated, MetaOptimize, .
http://stackoverflow.com: , . ,
, ,
SciPy matplotlib.
#machinelearning https://freenode.net/:
IRC-, .
,
.
http://www.TwoToReal.com:
,
, . , , ,
.
, .

26

1. Python

, , . , , .
,
( ) http://blog.kaggle.
com, Kaggle, .
, , ,
. , .


, Python ( ,
2.7), NumPy SciPy
matplotlib .

NumPy, SciPy matplotlib


, , ,
. , , .
,
. , .
, Python - (
), C FORTRAN.

Python, , ?
, , Python, , C
FORTRAN. NumPy SciPy
(http://scipy.org/Download). NumPy . SciPy
. , matplotlib (http://matplotlib.org/) ,
Python.

27

Python
, ,
Windows, Mac Linux, NumPy,
SciPy matplotlib. ,
, Anaconda Python ( https://store.continuum.io/cshop/anaconda/),
(Travis Oliphant),
SciPy. , Enthought Canopy (https://www.enthought.com/downloads/)
Python(x,y)
(http://code.google.com/p/pythonxy/wiki/Downloads),
Anaconda Python 3
Python, .

NumPy
SciPy


NumPy, , SciPy .

Matplotlib.

NumPy,
http://www.scipy.org/Tentative_NumPy_Tutorial.
Ivan Idris NumPy Beginner's Guide
( ), Packt Publishing. http://scipylectures.github.com, SciPy
http://docs.scipy.org/doc/scipy/reference/tutorial.
NumPy 1.8.1 SciPy 0.14.0.

NumPy
, NumPy .
Python:
>>> import numpy
>>> numpy.version.full_version
1.8.1

28

1. Python

, ,
, :
>>> from numpy import *

, , numpy.array
Python.
:

array

>>> import numpy as np


>>> a = np.array([0,1,2,3,4,5])
>>> a
array([0, 1, 2, 3, 4, 5])
>>> a.ndim
1
>>> a.shape
(6,)

, Python.
NumPy
. . .
:
>>> b = a.reshape((3,2))
>>> b
array([[0, 1],
[2, 3],
[4, 5]])
>>> b.ndim
2
>>> b.shape
(3, 2)

, NumPy. , :
>>> b[1][0] = 77
>>> b
array([[ 0, 1],
[77, 3],
[ 4, 5]])
>>> a
array([ 0, 1, 77, 3, 4, 5])

b 2 77 , , a.
, :

29

>>> c = a.reshape((3,2)).copy()
>>> c
array([[ 0, 1],
[77, 3],
[ 4, 5]])
>>> c[0][0] = -99
>>> a
array([ 0, 1, 77, 3, 4, 5])
>>> c
array([[-99, 1],
[ 77, 3],
[ 4, 5]])

c a .
NumPy
. ,
NumPy ,
:
>>> d = np.array([1,2,3,4,5])
>>> d*2
array([ 2, 4, 6, 8, 10])

:
>>> d**2
array([ 1, 4, 9, 16, 25])

Python:
>>> [1,2,3,4,5]*2
[1, 2, 3, 4, 5, 1, 2, 3, 4, 5]
>>> [1,2,3,4,5]**2
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: unsupported operand type(s) for ** or pow(): 'list' and 'int'

, NumPy,
Python. ,
, NumPy .
, , , , .

NumPy . ,
:

30

1. Python

>>> a[np.array([2,3,4])]
array([77, 3, 4])

,
,
:
>>> a>4
array([False, False, True, False, False, True], dtype=bool)
>>> a[a>4]
array([77, 5])

:
>>> a[a>4] = 4
>>> a
array([0, 1, 4, 3, 4, 4])

, clip, :
>>> a.clip(0,4)
array([0, 1, 4, 3, 4, 4])


NumPy ,
, . , -
numpy.NAN:
>>> c = np.array([1, 2, np.NAN, 3, 4]) # ,
#
>>> c
array([ 1., 2., nan, 3., 4.])
>>> np.isnan(c)
array([False, False, True, False, False], dtype=bool)
>>> c[~np.isnan(c)]
array([ 1., 2., 3., 4.])
>>> np.mean(c[~np.isnan(c)])
2.5


NumPy
Python.
1 1000 . 10 000 .

31

import timeit
normal_py_sec = timeit.timeit('sum(x*x for x in range(1000))',
number=10000)
naive_np_sec = timeit.timeit(
'sum(na*na)',
setup="import numpy as np; na=np.arange(1000)",
number=10000)
good_np_sec = timeit.timeit(
'na.dot(na)',
setup="import numpy as np; na=np.arange(1000)",
number=10000)
print("Normal Python: %f sec" % normal_py_sec)
print("Naive NumPy: %f sec" % naive_np_sec)
print("Good NumPy: %f sec" % good_np_sec)
Normal Python: 1.050749 sec
Naive NumPy: 3.962259 sec
Good NumPy: 0.040481 sec

. -, NumPy (Naive NumPy),


3,5 . ,
, C
. ,
Python .
, ,
.
: NumPy
dot(), , 25-
. ,
, Python
- , NumPy SciPy.
.
NumPy,
Python, .
NumPy .
>>> a = np.array([1,2,3])
>>> a.dtype
dtype('int64')

,
, NumPy
, , :

32

1. Python

>>> np.array([1, "stringy"])


array(['1', 'stringy'], dtype='<U7')
>>> np.array([1, "stringy", set([1,2,3])])
array([1, stringy, {1, 2, 3}], dtype=object)

SciPy
NumPy SciPy
, .
, SciPy. ,
, , ,
. , , scipy.
NumPy
SciPy. , ,
NumPy SciPy. ,
:
>>> import scipy, numpy
>>> scipy.version.full_version
0.14.0
>>> scipy.dot is numpy.dot
True

.
SciPy packages

Functionalities

cluster

(cluster.hierarchy)
/ k- (cluster.vq)

constants

fftpack

integrate

interpolate

(, . .)

io

linalg


BLAS LAPACK

ndimage

() ...

SciPy packages

Functionalities

odr

optimize

( )

signal

sparse

spatial

special

, ,

stats

33

scipy.stats,
scipy.signal.
stats, , .

scipy.interpolate, scipy.cluster

()

, MLaaS,
. ,
. ,
. , , - . ,
,
100 000 . ,
,
, .


,
ch01/data/web_traffic.tsv ( tsv , ). .
( ) .

34

1. Python

genfromtxt() SciPy :
>>> import scipy as sp
>>> data = sp.genfromtxt("web_traffic.tsv", delimiter="\t")

, ,
.
, :
>>> print(data[:10])
[[ 1.00000000e+00 2.27200000e+03]
[ 2.00000000e+00 nan]
[ 3.00000000e+00 1.38600000e+03]
[ 4.00000000e+00 1.36500000e+03]
[ 5.00000000e+00 1.48800000e+03]
[ 6.00000000e+00 1.33700000e+03]
[ 7.00000000e+00 1.88300000e+03]
[ 8.00000000e+00 2.28300000e+03]
[ 9.00000000e+00 1.33500000e+03]
[ 1.00000000e+01 1.02500000e+03]]
>>> print(data.shape)
(743, 2)

, , 743
.

Howe (nr} n ...

........

SciPy
11 743 . , , , ,
, - .
SciPy
, :
= data[:,OJ
= data[:,1]

SciPy .
, 111111
. http://www.scipy.org/Tentative_
NumPy_Tutorial.

l\l , 111 11 - nan. HI\IJJ ? ,


:
>>> sp.sum(sp.isnan(y))

, 8 743 ii,
. , SciPy 1\!
. sp.isnan(>
, , . ,
, :
>>> = x[-sp.isnan(y)]
>>> = y[-sp.isnan(y)]

,
,
atplotlib. pyplot,
MATLAB - :
>>> import matplotlib.pyplot as plt
>>># (,) 10
>>> plt.scatter(x, , s=lO)
>>> plt.title("Web traffic over the last month")
>>> plt.xlabel("Time")
>>> plt.ylabel("Hits/hour")
>>> plt.xticks([w*7*24 for w i range(lO)J,
['week %i' % w for w in range(lO)])
>>> plt.autoscale(tight=True)

........

1. Python

>>> #
>>> plt.grid(True, linestyle = '-', color = '0.75')
> plt. show ()

http://matplotlib.org/users/pyplot_tutorial.
html.

,
- ,
.
Web traffic over the last month

:...,'.

.;

:-: 4'

:
i'
2

LOGO

:::,

?}i;}/lYiH{1)ij:!:;;/
.

;r:,:>t:-.

,ek

, :
?
:
1. , .
2.
,
.

() ...

11111 ..

...

, -
.
, .

.

; f - , :
def error(f, , ):
return sp.sum((f(x)-y)**2)


.
SciPy. ,
,
.

,
. - ,
.
polyfit 1) SciPy. ,
( 1 ),
,
:
fpl, residuals, rank, sv, rcond = sp. polyfit( , , 1, full=True)

polyfit 1)
fpl. full=True,
. ,
:
>>> rint(" : 's" % fpl)
: [ 2. 59619213 989.02487106]
>>> print(residuals)
[ .17389767+08]

,
:
f(x) = 2.59619213 + 989.02487106.

polylct
110 :

........

1. Python

>>> fl = sp.polyld(fpl)
>>> print(error(fl, , ))
317389767.34

full=True,
. ,
.
.
:
fx = sp.linspace(O,x(-1), 1000) #
plt.plot(fx, fl(fx), linewidth =4)
plt.legend(["d=%i" % fl.order), loc= "upper left"I
:
Web traffic over the last month
6000

d=l

5000

.
....:

4000

:,

...1

3000

:f

2000

1000

v;erk l

wek. 3

,
. :
317 389 767.34 - ?

. ,
, .
,
. ,
.

Hawe () ...

111111

, 2.
, .
> f2p = sp.polyfit\x, , 2)
> print(f2p)
array([ 1.05322215-02, -5.26545650+, 1.97476082+])
>>> f2 = sp.polyld(f2p)
>>> print(error(f2, , ))
179983507.878

.
Web traffic over the last month

6000

5000

4000

,_;:
'

' ./,

,::


1000

1000

wetk

week 1

wek 2
1111

179 983 507.878,


. , :
, , polyfit <)
.
:
f(x) = .0105322215 * **2 - 5.26545650 + 1974.76082

,
? 3, 10 100.
, 100

........

n 1. Python

RankWarning: Polyfit may poorly conditioned

, - polyfit
100 ,
53 .
Web traffic over the last month
6000

d=l

soco

4000

3000

lOCO

1000

week

week 1

\\leek 2
Tin1e

week

wel!'k4

, ,
.
:
Error
Error
Error
Error
Error

d=l: 317,389,767.339778
d=2: 179,983,507.878179
d=3: 139,350,144.031725
d=lO: 121,942,326.363461
d=53: 109,318,004.475556

,
,
, . ,
,
? 10 53
. ,
. ,
, .
.

Hawe () ...

.........

, 111
:
- m
;
, ;
.
m1 ,
10 53 .
- .
, 1
.
pyroii ii - -
.
? ,
.

, -

, II .
.
,
3,5:
inflection = .5*7*24
x[:inflection]

y[:inflection]

x[inflection:]

y[inflection:J

fa
fb

#
#
#

sp.polyld(sp.polyfit (, , 1))
sp.polyld(sp.polyfit(xb, , 1))

fa_error = error(fa, , )
fb_error = error(fb, , )
print("Error inflection= %f" % (fa_error + fb_error))
Error inflection= 132950348.197616

, .
, 1
, .
, ,
. ?

........

1. Python
Web traffic over the last month

6000

5000

. .
-i
.::.-

4000

2000

1000

:/:::;; :/iJI:.;y*'f/{_
:".

11?

we. 3

, ,
, ,
? ,
. , ,
(d=1 ).
Web traffic over the last month

10000

d-1
d=2
d=
d-10
d=S

8000

6000

2000

week

,ek 1

wek 2

week 4

Howe () ...

.....
,

10 53, ,
.
,
. . ,

. .
2
, ,
. ,
,
. ,
,
.
Web traffic over the last month

10000

d=l
d=2
d=
d=lO
d=S

8000

6000

4000

lCOO

wek

'llt't' 3

,
,
(k,
):
Error
Error
Error
Error
Error

d=l:
d=2:
d=3:
d=lO:
d=53:

22,143,941.107618
19,768,846.989176
19,766,452.361027
18,949,339.348539
18,300,702.038119

1111111111

1. Python

- ,
, ,

.
,
, .
, -
.
.
,
.
, , :
Error
Error
Error
Error
Error

d=l: 6397694.386394
d=2: 6010775.401243
d=3: 6047678.658525
d=lO: 7037551.009519
d=53: 7052400.001761

:
lOQCO

Web trafflc over the last month


d=l
d=2
d=
d=lO
d=S

0000

'

'

2000

.-:,:>,:'.:\
.01

, , :
2 ,

() ...

111111

, .
, .

, , ,
; ,
100 .
'I, 100 .
2
.'l 100 . , ,
.
100
11111. SciPy optimize fsolve,
, ,
.
, 743 ,
- , .
fbt2 - - 2.
>>> ft2 = sp.poly ld(sp.polyfit (xb[train], yb[train], 2))
>>> print("fbt2(x)= \nis" % fbt2)
ft2(x)=
2

0.086 - 94.02 + 2.744+4


>>> print("fbt2(x)-100,000= \n%s" % (fbt2-100000))
fbt2(x)-100,000=
2

0.086 - 94.02 - 7.256+4


>>> from scipy.optimize import fsolve
>>> reached_max = fsolve(ft2-100000, 0=800)/(7*24)
> print("l00,000 hits/hour expected at week U" % reached_max[O])

, 100 /
9.616071, ,
il
,
.
, .

,
.
,
. ii

1. Python

. ,
. II
1 l\1, ,
.

! ,
-
, ,
,
. ,
. ,
1 II .
, -
. ,
,
- .
scikit-learn, l\l
,
,
.

rllABA2.
n
n
- . ,
u,
. 11
. ,
.


.
ii : , 11

.
,
,
. -
- .,
.
1<, ,
II ,
. ,
. : <<
, ?,>.
,
.

. :\t
, ,
scikit-learn. -
,
.

........

2.

lris

Iris - , 1930- ;
11
.

. ,
. 111-1111
111 , 1930- 11
.
i\:
;
;
;
.
l, , 11,
, .
1
.
l\1 .
. l i\ : <<
, .1111,
11?>->.
1 ,
1111: 1111, ,
1, 11.
, , :
, 11
, 1, -
- , r,
.
( )
11 . Iris
. 1,11 (150 , ),
11:111 1.

il . ,
. 11

........

lris

.
MqI ,
,
. , -
,

.

. () Iris Setosa, Iris Versicolor plants
() Iris Virginica (). ,
: Iris Setosa Iris Versicolor Iris Virginica.

.t::

"
3
.

"'
CIJ

.!

QJ
.

QJ
.

.
i5
3

....

sepal length (cm)

.
,

.t::
,

.t::

i5
3
..,.

... .....

..

sepal width (cm)

sepal length (cm)

CIJ

,.,.,;! .w..i..

111

QJ
.

f.

i,,,...,,1

....

-. .

.,t,..::r.....

sepal width (cm)

.."..t...:-.... .. ::
.....

sepal length (cm)

"
3
10
.t::

CIJ

...
petal tength (cm)

:
>>> from matplotlib import pyplot as plt
>>> import numpy as np
>>> load_iris sklearn
>>> from sklearn.datasets import load iris
>>> data = load_iris()
>>> # load iris

........
>>>
>>>
>>>
>>>

2.

features = data.data
feature names = data.feature names
target = data.target
target_names = data.target_names

>>> for t in range():


if t == :
= 'r 1
marker = '>'
elif t == 1:
= 'g'
marker = ''
elif t == 2:
= ''
marker = ''
plt.scatter(features[target
features[target
marker=marker,
=)

t, ],
t, 1],

- ,
, .
, Iis Setosa
. , :
>>># NumPy,

>>> labels = target_names[target]
>>># - 2
>>> plength = features[:, 2)
>>>#
>>> is_setosa = (labels == 'setosa')
>>>#
>>> max_setosa = plength[is_setosa].max()
>>> min_non_setosa = plength[-is_setosa] .min()
>>> print('Maximum of setosa: {).' .format(max setosa))
Maximum of setosa: 1.9.
>>> print('Minimum of others: {).'. format(min_non_setosa))
Minimum of others: 3.0.

, :
2, Iris Setosa, Iris Virginica, Iris
Versicolor.

lris

........

lis Sctosa , .
,
, .
1 111 , ,
.
Iris Setosa .
, Iris Virginica
Iris Versicolor, . , ,
.
: ,
. -
.
,
Setosa:
>>>
>>>
>>>
>>>
>>>

# - -
features = features[-is_setosa]
labels = labels\-is_setosa]
i is_virginica
is_virginica = (labels == 'virginica')

N.
is_setosa - , 1,
, features labels.
, ,
.
,
, . .
>>> # .rD,! est_acc ,
>>> best = -1.0
>>> for in range(features .shape[l]):
#
thresh = features[:,]
for t in thresh:
i 'fi'
feature_i = features[:, ]
# 't'
pred = (feature_i> t)
= (pred == is_virginica).mean()
rev_acc = (pred == -is_virgiica) .mean()
if rev > :
reverse = True
rev
else:

........

2.

reverse = False

if > best :
best

best fi = fi
best t = t
best reverse = reverse

II
: <, 11
.
rev_acc, .
.
pred , is_virginica.

, .
, best_fi, best_t
est_reverse .
,
. :
def is_virginica_test(fi, t, reverse, example):
"l threshold model to new example"
test = example[fi] > t
if reverse:
test = not test
return test

?
, , ,
. ,
, .
, ,
- , .
: II . , ,
I1is Virginica, -
Iris Versicolor.
,
.
.
( ),
.
.

111111111

lris

,:

;;;

Q,

,:

:::

:<

><
1.0

,:

;;;,,, ,

'

..-;

,
;.(



: !

1.5

2.0

2,5

petal v11dth (cm)

-

,
94 % .
.
,
. ,
, .
.

.
, 1'/ .
,
. :
, . ,
, .
:
Training accuracy was 96.0%.
Testing accuracy was 90.0% (N

50).

........

2.

, (
) , .
, .
, ,
, .
, , ,
. , , ,
, 1111 - ,
, . ,
,
.

.
, ,
.

1111
.
. ,
,
- , !
, , 11
. ,
. , 11
,
.
, , .
1111 11:-.1
.
. -
, ,
, ,
.
.
.1 :
>>> correct = .
>>> for ei in range(len(features)):
# , 'ei':
training = np.ones(len(features), bool)
training[ei] = False

11111 ..

lris

testing = -training
model = fit_model(features[training], is _ virginica[training])
predictions = predict(model, features[testing])
correct += np.sum(predictions == is_virginica[testing])
>>> = correct/float(len(features))
>>> print('Bepoc: {0:.1%}'.format(acc))
: 87.0%


, ero .
r ,
,
. 1
r.
.

, 1
. ,
, .

- ,
r - . ,
5 .
, 111
.
ii , 110 20 ,
. ,
, .

,
, ...,
'

2.

:
.
.
. ,
( -
),
. - .
80 ,
, 11
. , 110 10
20 . , ,
,
. ,
, ,
2 3 .
. ,
- ,
.
, r
1 scikit-learn .
, r , ,, i'.
-
? -
. ,
.

.

.

,
,
,
. n ,
II 11
. r11
,
.

n n n

11111 ..

,:
. ? !
. , 110
.
: << 1111?1>. .
. 1111i\ ?
,
.
r1 .
. ?
1111
. ,
II
11:v1 11
.
011 ( , scikit-learn ,
, , ttii
).
. ,
?
, ,
, .
, ,

. ,
, .
,
, ,
.
( )
( ).
, r , .
,
1. -

........

2.

,
Ii.
:-.1 i-1111111,
.
, 1111
. II .
. .
11111 111 II
.
( ,
) , l
. (
, )
.
,
( ,
, ,
). , n {
. , -11, ,
1 ,
, 111111 .
( )
. ,
,
. ,
, ,
.



1, 111,1ii .
111
- .

Seeds
,, 11 ,
, ,,

...

111111

: ii , Iis.
11 .
:
;
;
= 4nA/P i;
;
;
;
.
, , :
Canadian, II Rosa. , -
.
Iris, 1930- ,
,
.

: , ,

. 10 ,

. ,
.
1 UCI
(UCI)
(
223 ). lris Seeds
. http://archive.ics.
uci.edu/ml/.

: ,
, .
.
(featurc
egineering). , ,

........

2.

(
, ).
, -
r. 11r
.
, ,, .
(
, - ).
,
( ),
, . ,
, .
, ..

. , !\11
II , ,
. ,
,
. ,
, , (
).
( , ). ,,
; - ,
.
, ,
. ,
.
-
. .
,
. '!,
, ,
, , ,
.


- . ii 1111. , ,
r 11 1111 11ii111'i .

scikit-learn

........

. ,
11 !
ii ,
r1 ( ,
, ,
).

.
11r - ,
.

.


scikit-learn
, Python -
1ii ,
. , scikit-lea.-n
,
.
.
API scikit-lea
. :
fit (features, labels):
;
predict ( features): fit
1111 .
1< k
.
KneighborsClassifier sklearn.neighbors:
>>> from sklearn.neighbors import KNeighborsClassifier
scikit-learn sklea.-n (
scikit-learn ).
sklearn - , sklearn.
neighbors.
.
:

........
>>> classifier

2.

KNeighborsClassifier(n_neighbors= l)

,
5, .
(
?). scikit-learn :
>>>
>>>
>>>
>>>
>>>

from sklearn.cross_validation import KFold


kf = KFold(len(features), n_folds=5, shuffle=True)
# 'means ' - (no n}
means = []
for training,testing in kf:
# ,
# 'predict':
classifier. fit(features [ training], labels[training]}
prediction = classifier.predict(features[testing])

# np.mean, ,
#
curmean = np.mean(prediction == labels[testing]}
means.append(curmean}
>>> rit(" : { :.1%)".format(np.mean(means}))
: 90.5%


90.5 %.
, ,
,
.

. ,
, !'. , .
:
Canadian , , Rosa - .
, . ,
. ,
() 10 22, ()
0,75 1,0. ,
.
.
,
, 11 - .

11 scikit-leorn
1.00
0.95
.., 0.90

0.85
0.80
0.75
10

12

14

16

18

20

22

( ), , ,
, ,
, (
).
.
-;
z-. Z- ,
,
. :
!'-

/-
(1

f - , /' -
, - , u -
. u .
,
z- ,
- ,
- .

........

2.

scikit-learn
.
: ero , -
.
:
>>> from sklearn.pipeline import Pipeline
>>> from sklearn.preprocessing import StandardScaler
:
>>> classifier = KNeighborsClassifier(n_neighbors=l)
>>> classifier = Pipeline([('norm', StandardScaler()),
('knn', classifier)))
Pipeline (str, clf).
; , , - , .

.

( , ,
),
.
, 93%
!

-1.0

-0.5

0.5

1.0

1.5

2.0

........

110-1. 11 1111,
.
111 ,
, :
1111, .

ii
. ,
, .
- -
, .
,
, 11 .

ii. Iris,
: , ,
,
:
1. Iris Setosa ( )?
2. , , Iris Vigiica ( 11 ).
,
. ,
ii .
-,
11 .
- ?
,
. , ,
, , 1111
.
1111 11111111.

, : << -
?. 11 ,
. -

........

2.

Iris. -
. , r
r . , , r
,
.

r
r.
. scikit-learn
sklearn.multiclass.
,
.

,
.
, , , ,
,
.


( ,
, ).
'.
ii 1<11111.

........

- ,
.
Iis. ,
,
, .
.
111- .
, -
, 1
. ,
.
,
,
( ,).
1 .
, 111,1
-
. ,
,
. ,
,
, .
, ,
.

r3.
1111 - n
11
6
, ,
. ,
, <1
. 11
, <11
.
, ,
, ,
. ,
- ,,
? ?
, .
-
. ' .
:-.1 ,
- .
- , , ,
.
, , <
, ,
.
- ,
II
.
. ,
:v .

........

}I 1 111.
, r , - . ,
,, - ,
m .
, 1\ ,
, .
,
. 1
Sciit, 11 11 r
, .
"

<<>
. ,
,
1111. II
, , .

,
. : <<hi,> 11 <<hi.

, .
: <ia <1111
<,,>. ii ,
.
, :

. (
): <<How to 1t had disk,> ( <
) <iHard disk t"orat s ( <
,>).
5, <<how, <<tO>>,
<,forat, <, <,ft,> <,s.
,
, ,
, {1. -

IIZ:III

. - ...

110 ,
11.
, ,
. format,>
2 , ,
. ,
.

:
. 1111
, ,1 .

, . ,
.
, ,
. , ,
;\ :
n

1
'

'

disk
format

how
hard

my
proems
to

<< 1,> <,


2 :1 .

, (,
, ,). , ,
,
:
1. 11111 11 t
n n .

- ...

1111 IJI

2. 1111 .
3. , -.
4. - ,,
11 -.
1111.
ii -
. .

, II
, . 111 .
.

cuoii II
. countVectorizer
SciKit , II
. Sciit 1111
sklearn:
>>> from sklearn.feature_extraction.text import CountVectorizer
>>> vectorizer = CountVectorizer(min_df = l)

min_df ( ) ,
countVectorizer
. , 11
, . 1111 - ,
,
. max_df .
, 1, , SciKit
,,111 :
>>> print(vectorizer)
CountVectorizer(analyzer = 'word', binary = False, charset = None,
charset_error = None, decode_error = 'strict',
dtype = <class 'numpy.int64'>, encoding = 'utf-8',

........ 3. 11 - 11...
input = 'content',
lowercase = True, max_df= l.O, max_features = None, min_df = l,
ngram_range = (l, 1), preprocessor = None, stop_words = None,
strip_accents = None, token_pattern = '(?u)\\b\\w\\w+\\b',
tokenizer= None, vocabulary = None)

(analyzer = worct), ,
token_pattern. , cross-validated!>

n : <,cross,> validated,>.

, :
>>> content
proems "]

["How to format my hard disk", " Hard disk format

fit_transform(),

>>> = vectorizer.fit_transform(content)
>>> vectorizer.get_feature_names()
[u'disk', u'format', u'hard', u'how', u'my', u'proems', u'to'J

,
:
>>> print (X.toarray().transpose ())
[ [ 1 11
[ 1 11

[ 1 1]
[1
[1
[
[1

]
]
1]
] J

, ,
pI'oles!>, - , <,how!>, ),) <<tO!>.
, ,
. ,
.
, ,

. -
.
, .1
.

- ...

1111 11111

,
.

01.txt

This is toy post about machine learning. Actually, it contains not


much interesting stuff.

02.txt

lmaging databases get huge.

03.txt

Most imaging databases save images permanently.

04.txt

lmaging databases store images.

05.txt

lmaging databases store images. lmaging databases store images.


lmaging databases store images.

,
<<iagig databases.
, DIR,
CountVectorizer:
>>> posts = [open(os.path.join(DIR, f)) .read() for f in
os.listdir(DIR)]
>>> from sklearn.feature_extraction.text import CountVectorizer
>>> vectorizer = CountVectorizer(min_df=l)

,
, :
>>> X_train = vectorizer.fit_transform(posts)
>>> num_samples, num_features = X_train.shape
>>> print(#samples: %d, #features: %d % (num_samples,
num_features))
#samples: 5, #features: 25

5 11 25 .
:
>>> print(vectorizer.get feature_names())
[u'about', u'actually', u'capabilities', u'contains', u'data',
u'databases', u'images', u'imaging', u'interesting', u'is', u'it',
u'learning', u'machine', u'most', u'much', u'not', u'permanently',
u'post', u 1 provide', u'save', u'storage', u'store', u'stuff',
u'this', u'toy']

:
>>> new_post = imaging databases
>>> new_post_vec = vectorizer.transform([new_post])

........ 3. - ...

, transform
, ,
(
).
coo_matrix ( cOOrdinate,> ). ,
:
>>> print(new_post_vec)
(, 7) 1
(,

5)


ndarray:

toarray <),

>>> print(new_post_vec.toarray())

[[ 1 1 ]]

,
. (
),
:
>>> import scipy as sp
>>> def dist_raw(vl, v2):
delta = vl-v2
return sp.linalg.norm(delta.toarray())

norm (
). 11,1 ,
. Distance Coefficients between
T\vo Lists or Setsi,, The Python Papers S Codes,
(Maurice Ling) 35 .
ctist_raw, ,
:
>>>
>>>
>>>
>>>
>>>

import sys
best doc = None
best_dist = sys.maxint
best i = None
for i, post in enumerate(num_samples):
if post == new_post:
continue
post_vec = X_train.getrow(i)
d = dist_raw(post_vec, new_post_vec)
print("=== Post %i with dist=%.2f: %s"%(i, d, post))
if d<best dist:
best dist = d
best i = i

- ...

1111 llfill

>>> print("Best post is %i with dist=%.2f"%(best_i, best_dist))


=== Post with dist=4.00: This is toy post about machine learning.
Actually, it contains not much interesting stuff.

== = Post 1 with dist = l.73: Imaging databases provide storage


capailities.
=== Post 2 with dist=2.00: Most imaging databases save images
permanently.
Post with dist=l.41: Imaging databases store data.
Post 4 with dist = 5.10: Imaging databases store data. Imaging
databases store data. Imaging databases store data.
Best post is with dist=l.41

,
. r .
, . ,
1 , 110 , . .
,
3.
3 4 .
4 - 3, .
ero ,
3.
, :
>>>
[ [
>>>
[ [

print(X_train.getrow().toarray())
1 1 1
print(X_train.getrow(4).toarray())
3

]]

]]

, .
.

dist_raw
, :
>>> def dist_norm(vl, v2):
vl_normalized = vl/sp.linalg.norm(vl.toarray())
v2_normalized = v2/sp.linalg.norm(v2.toarray())
delta = vl_normalized - v2_normalized
return sp.linalg.norm(delta.toarray())

llflll

3. - ...


:
=== Post with dist=l.41: This is toy post about machine learning.
Actually, it contains not much interesting stuff.
=== Post 1 with dist=0.86: Imaging databases provide storage
capalities.
=== Post 2 with dist=0.92: Most imaging databases save images
permanently.
Post 3 with dist=0.77: Imaging databases store data.
Post 4 with dist=0.77: Imaging databases store data. Imaging
databases store data. Imaging databases store data.
Best post is 3 with dist=0.77

. ii 3 4
. -, , ,
,
.

2.
, : most (),
<<save (). <<images,> () <<permanently (
). 1111 u .
most,>, ,
-.
<<images.
. ,
,
.
, count
vectorizer :
>>> vectorizer

CountVectorizer(min_df = l, stop_words ='english')

, - ,
. stop_worcts
english, 318 .
:\.1, get_stop_worcts():
>>> sorted(vectorizer.get_stop_words()) (0:20)
['', 'about,

above', 'across',

after', afterwards', 'again',

- ...

1111 lil

'against', 'all', 'almost', 'alone', 'along', 'already', 'also',


'although', 'always', 'am', 'among', 'amongst', 'amoungst']

:
[ u'actually', u'capailities', u contains', u'data', u'databases',
u'images', u'imaging', u'interesting', u'learning', u'machine',
u'permanently', u'post', u'provide', u'save', u'storage', u'store',
u'stuff', u'toy']

-
:
=== Post with dist=l. 41: This is toy post about machine learning.
Actually, it contains not much interesting stuff.
=== Post 1 with dist=0.86: Imaging databases provide storage
capailities.
=== Post 2 with dist=0.86: Most imaging databases save images
permanently.
Post 3 with dist=0.77: Imaging databases store data.
Post 4 with dist=0.77: Imaging databases store data. Imaging
databases store data. Imaging databases store data.
Best post is 3 with dist=0.77

1 2 .
,
. ,
.

.
. ,
2 imaging images. ,
, .
, r,
. Sciit .
Natura 1 Language Toolkit (NLTK!,
, countVectorizer.
NLTK
NLTK
http: ! /nltk.org/install. html.
, Python 3 -

........ . - 1 ..

, , , pip .
http://www.nltk.org/nltk-alpha/,
setup..

Python :
>>> import nltk
NLTK Jacob Perkins
Python 3 Text Processing with NLTK 3 Cookbook, Packt
Puishing. , http://textprocessing.com/demo/stem/.

NLTK . ,
.
SnowballStemmer.
>>> import nl tk.stem
>>> s = nltk.stem.SnowballStemmer('english')
>>> s.stem("graphics")
u'graphic'
>>> s.stem("imaging")
u'imag'
>>> s.stem("image")
u'imag'
>>> s.stem("imagination")
u'imagin'
>>> s.stem("imagine")
u'imagin'
,
.

:
>>> s.stem("buys")
u'buy'
>>> s.stem("buying")
u'buy'

, ':
>>> s.stem("bought")
u'bought'
bougl1t - 1110 111111 1111;11.1110 .11.1 IJuy (1111).
n, 11 111 1 . - .. nept'IJ.

- ...

1111 llfiJI


NLTK


countvectorizer.
, II
.
n .
, ,

. build_analyzer:
>>> import nltk .stem
>>> english_stemmer = nltk.stem.SnowballStemmer('english'))
>>> class StemmedCountVectorizer( CountVectorizer):
def build_analyzer(self):
analyzer = super(StemmedCountVectorizer, self).build_analyzer()
return lamda doc: (english_sterrmer.stem(w) for w in analyzer(doc))
>>> vectorizer = StennedCountVectorizer(min_df=l, stop_words='english')

;1
.
1. (
) n
.
2.
( ).
3. .
,
iages imaging,> .
I<:
[u'actual', u'', u'contain', u'data', u'databas', u'imag',
u'interest', u'learn', u'machin', u'perman', u'post', u'provid',
u'save', u'storag', u 1 store 1

ustuff', u'toy']

<,iages iagin,>
, ,
2,
iag:
== = Post with dist=l. 41: This is toy post about machine learning.
Actually , it contains not much interesting stuff.
=== Post 1 with dist=0.86: Imaging databases provide storage
capailities.
=== Post 2 with dist= 0.63: Most imaging databases save images

........ 3. - ...

permanently.
Post 3 with dist=0.77: Imaging databases store data.
Post 4 with dist=0.77: Imaging databases store data. Imaging
databases store data. Imaging databases store data.
Best post is 2 with dist=0.63

,
, II
, .
-
. , ,
. , ,
<<sttbject,> (),
? , , countVectorizer
, max_df. ,
. 9, , 90
11, . 89
? , max_df?
, , -
: , .
-
<,,> ,
. , ,

, .
<< (term freqt1ency - inverse document
freqt1ency, TF-IDF). TF , IDF -
,>. :
>>> import scipy as sp
>>> def tfidf(term, doc, corpus):
tf = doc.count(term) / len(doc)
num_docs_with_term = len([d for d in corpus if term in d]J
idf = sp.log(len(corpus) / num_docs_with_term)
return tf idf
, , 110
. 1111
.

- ...

1111 I

1
, , -,
:
>>> , , = [""], ["", "", ""], [ .. ", "", "]
>>> = [, , ]
>>> print(tfidf("a", , ))
.
>>> print(tfidf("a", , ))
.
> print(tfidf("a", , ))
91

>>> print(tfidf("b", , ))

0.270310072072

>>> print(tfidf("", ,

D))

>>> print(tfidf("", , ))
0.135155036036

>>> print(tfidf("c", : , ))

0.366204096223

, ,
. , ,
.
,
. SciKit ,
TfidfVectorizer, CountVectorizer.
, :
>>> from sklearn.feature_extraction.text import TfidfVectorizer
>>> class StemmedTfidfVectorizer(TfidfVectorizer):
def build_analyzer(self):
analyzer = super(TfidfVectorizer, self) .build_analyzer()
return lambda doc: (
english_stemmer .stem(w) for w in analyzer(doc))
>>> vectorizer = StemmedTfidfVectorizer(min_df=l,
stop_words= 'english', decode_error = 'ignore')
.
TF-IDF .


:
1. II .
2. ,
.

........ 3. - ...

3. , ,
.
4. .
5. TF-IDF 111 1
.
.

1ii .


, .
.
, <,Car hits wall (
) <,Wall hits (
) .
. , <,1 will eat ice
1.1111,> ( ) 11 1 will ot eat ice r
( )
, .
, ,
(l\1). (
) 11 (r).
.
, <<database <,databas,>
,
.
, ,

.

, , , ,
.
, 11 .

: 111 .
111111
111, , 11 :J:111111,1 .

........

- . 11, ,
,
ii .
.

- .
,
. ,
, .

1111. .
sklearn.cluster SciKit
.
http: //scikit-learn.org/
dev/modules/clustering.html.

- -
.

-
. , num_clusters,
.
num_clusters
.

, .
-
. ,
.
.
. ,
.
II
, ,
.
,
.
.

........ 3. n - ...

1.0

N
1\1
ID

"'"'

,.

0.8

Q.

:i:

04

)(

a:i

.?

00

..

0.6

Q4

0.3

1.0

,
,

:
1

1.0

0.8

N
1\1

"'

0.6

:i:

0.4

a:i

02

00

02

04

(,

03

10

........

,

. :
10

0.8

"'

'<

0.2

.--------.
0.4
o.z

0.6

0.3

1.0

.
( SciKit
0.0001).

, .
, ,
,
.

,
, ,
.
,
, , ,
.

........ 3. 3- 33...


20newsgroup, 18 826 20
- , comp.sys.
mac. hardware sci. crypt,
, 1 talk.politics.guns soc.religion.
.
, ,
.
http: //people.csail.
mit.edu/jrennie/20Newsgroups. ,
MLCoinp http: //mlcomp.org/datasets/379 (
). Sciit
.
ZI- ctataset-379 20news-18828_WJQIG.zip, 379.
Sciit .
: test, train raw. train
60 % , ,
test - 40 %, .
L _DSs_,
mlcomp_rot
.
http: / /mlcomp. org
.
:
, . ,
,
,
.

sklearn.ctatasets
fetch_20newsgroups, :
import sklearn.datasets
all_data = sklearn.datasets.fetch_20newsgroups(subset='all')
print(len(all_data.filenames))
18846
>>> print(all_data.target_names)
[ alt.atheism', 'comp.graphics , comp.os.ms-windows.misc ,
'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware',
'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles',
rec.sport. baseball , rec.sport.hockey , sci. crypt ,
'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian',
>>>
>>>
>>>

1111 ....

'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc',


'talk.religion.misc']

:
>>> train_data = sklearn.datasets.fetch_20newsgroups(subset= 'train',
categories = groups)
>>> print(len(train_data.filenames))
11314
>>> test_data = sklearn.datasets.fetch_20newsgroups(subset='test')
>>> print(len(test_data.filenames))
7532

,
. categories:
>>> groups = [ 'comp.grap hics', 'comp.os.ms-windows.misc',
'comp.sys.ibm.pc.hardware , 'comp.sys.mac.hardware',
'comp.windows.x', 'sci.space']
>>> train_data = sklearn.datasets.fetch_20newsgroups(subset='train',
categories =groups)
>>> print(len(train_data.filenames))
3529
>>> test_data = sklearn.datasets.fetch_20newsgroups(subset='test',
categories = groups)
>>> print(len(test_data.filenames))
2349

, , ,
. - .
,
UnicodeDecodeError.

,
:
>>> vectorizer

StemmedTfidfVectorizer(min_df = lO, max_df = O. 5,


stop_words = 'englis h', decode_error= 'ignore')
>>> vectorized
vectorizer.fit transform(train_data.data)
>>> num_samples, num_features = vectorized.s hape
> print(isamples: %d, Heatures: %d % (num_samples, num_features))
#samples: 3529, #features: 4712

3529
4172- .
. 50 ,
, .

........ 3. - ...
>>> num clusters = 50
>>> from sklearn.cluster import KMeans
>>> km = KMeans(n_clusters=num_clusters, init= 'random', n_init=l,
verbose=l, random_state= 3)
>>> km.fit(vectorized)

II . , ranctom_state,
,.
.
km.
, ,
km.labels_:
>>> print{km.labels_)
(48 23 31 ..., 6 2 22]
>>> print(km.labels_.shape)
3529

km.cluster_
,
km.predict.
centers_.


,
new_post:

Disk drive pmlems. i, / have 1"! reJith had disk.


After 1 1" it is working only spomdically nore,.
1 tied to f011nat it, but norC' it doesn 't boot more.
ideas? Thanks.
, 11 ,
11:
>>> new_post_vec = vectorizer.transform([new_post])
>>> new_post_label = km.predict{new_post_vec) []

, ,
new_post_vec .
.
:
>>> similar_indices

{km.labels_ ==new_post_label) .nonzero{) (0)

........

nonzero uu 1111ii ,
, rue.
similar_indices 11
:

> similar = []
>>> for i in similar indices:
dist = sp.linalg.norm((new_post_vec - vectorized[i]) .toarray())
similar . append((dist, d ataset.d ata[i]))
>>> similar = s orted(similar)
>>> print(len(similar))
131

, . 131
. :u u ,
, (show_
at_l) , - -
.
similar[OJ
>>> show_at_l
>>> show_at_2 = similar[int(len(similar)/10)]
>>> show_at_ = similar[int(len(similar)/2)]

1111
.

1.038

PROBLEM with IDE controller


Hi,
l've got Multi 1/0 card (IDE controller + serial/parallel
interface) and two floppy drives (5 1/4, 3 1/2) and Quantum
ProDrive connected to it. 1 was to format the hard
disk, but I could not boot from it. 1 can boot from drive :
(which disk drive does not matter) but if I remove the disk
from drive and press the reset switch, the LED of drive :
continues to glow, and the hard disk is not accessed at all.
I guess this must proem of either the Multi 1/ card
or floppy disk drive settings (jumper configuration?) Does
someone have any hint what could the reason for it. [... ]

1.150

Booting from drive


I have 5 1/4" drive as drive . How can I make the system
boot from my 3 1/2" drive? (Optimally, the computer would
to boot: from either or , checking them in order
for t disk. But: if I have to switch s around and
simply switch the drives so that: it can't boot 5 1/4" disks,
that's . Also, boot_b won't do the trick for me. [... ]

[ ... ]

........ 3. - ...
'

1.280

IBM PS/1 vs FD

Hello, 1 already tried our national news group without


success. 1 tried to replace friend s original IBM floppy disk
in his PS/1-PC with normal drive. 1 already identified
the power supply on pins 3 (5V) and 6 ( 12V), shorted pin 6
(5.25""/3.5'" switch) and inserted pullup resistors (22) on
pins 8, 26, 28, 30, and 34. The computer doesn't complain
about missing FD, but the FD s light stays on all the time.
The drive spins up o.k. when I insert disk, but I can't access
it. The works fine in normal . there any points
I missed? [ ... ]

[ ... ]

, .
,
. (booting),
, . ,
. ,
, .

, ,
(, comp.
graphics) . ,
, .
.
>>> post_group = zip(train_data.data, train_data.target)
>>> all = [(len(post[O]), post[O], train_data.target_names[post[l]])
for post in post_group]
>>> graphics = sorted([post for post in all if
post[2J=='comp.graphics'])
>>> print(graphics[5])
(24 5, 'From: SITUNAYA@IBM3090. ..UK\nSubject:
test....(sorry)\nOrganization: The University of Birmingham, United
Kingdom\nLines: 1 \nNNTP-Posting-Host: ibm3090.bham. .uk< ...snip...>',
'comp.graphics')

r comp.
r , 11111
,1 :

graphics,

>>> noise_post = graphics[5] [1]


>>> analyzer = vectorizer.build analyzer()

11111 IIJI

>>> print(list(analyzer(noise_post)))
['situnaya', 'ibm3090', 'bham', '', 'uk', 'subject', 'test',
'sorri 1 , 'organ', univers', 'irmingham', 'unit', 'kingdom', 'line',
'nntp', 'post', 'host, 'ibm3090', 'bham', '', 'uk']

, ,
-.
, -
fit_transform - min_df
max_df, :
>>> useful = set(analyzer(noise_post)) .intersection
(vectorizer.get_feature_names())
>>> print(sorted(useful))
['', 'irmingham', 'host', 'kingdom', 'nntp', 'sorri', 'test',
'uk', 'unit', 'univers']

.
, IDF. ,
TF-IDF,
. IDF :
, , .
>>> for term in sorted(useful):
print('IDF(%s)=%.2f'%(term,
vectorizer._tfidf.idf_[vectorizer.vocabulary_[term]J))
IDF(ac)=.51
IDF(birmingham)=.77
IDF(host)=l.74
IDF(kingdom)=.68
IDF(nntp)=l.77
IDF(sorri)=4.14
IDF(test)=.83
IDF(uk)=.70
IDF(unit)=4.42
IDF(univers)=l.91

, ,
kingdom, ,
, IDF. ,

.
, , , -
, .
, ,
.
irmingham

........ . 11 - 11...

? JI
, ?
! ,
, max_features
( !). ,
. , ,
. ,
.1 :
, , .
.
, ,
<<,>. SciKit ,
. sklearn.metrics
. ,
- .

,
,
l\1,
,
. u
yc1111ii. -
1,
.
, ,
SciKit .
.
, .
.

r4.

n

.
, . ,
.
11 Python. - 1,
<<Pytho, << ,>?
1111 -
. -11 11
. , , , <
, , , .
1111 ,
, :-.1
1111ii .
. r.
, , 1111, 1,
.
' II
. ii
, 1 {1 0110,
. , ii
,
.

11,
LDA: !l (latent Dirichlet
allocation), ,
1111 (linea disci111i11ant analysis) - -

........

4.

. ,
, . scikit-learn
sklearn. lda,
.
scikit-lea .
,
, ,
.
http:;;
en .wikipedia .org/wki/Latent_Dirichlet_allocation.

LDA 11
1- . LDA
, ,
, . ,
- ,
. LDA ,
.
. , <,Pythoni; varia\ei;
() , ineb1iatedi (
) - . ,
,
, .
, , :
;
Python;
.
n. ,
, ,
, 50/50. 1111
, 70/30. 1- ,
n . , 11
; , 11 .
,
, .
.
, 11i1 .
, 110
. 1111 ,
:\11, ,
t.

........

. -
1 ,
,
.

, scikit-lear
. l\1 11 Python
gensi. -

.
pip install gensim


Associated Press ().
,
111-111. ,
:
>>> from gensim import corpora, models
>>> corpus = corpora.leiCorpus('./data/ap/ap.dat', './data/ap/voca.txt')

corpus
, .
:
>>> model = models.ldamodel.LdaModel(
corpus,
num_topics = lOO,
id2word = corpus.id2word)


,
.
. model[docJ
, :
>>> doc = corpus.docbyoffset(O)
>>> topics = model[doc]
>>> print(topics)
[ (3, .023607255776894751),
(13, .11679936618551275),
(19, .075935855202707139),

........
(92,

.4.

.10781541687001292)]

!
,

.
,
. ,
, , ,
. ,
.
(topic_inctex, topic_weight).
,
( , 1 2 , ).
, ,
. ,
, LDAy
, ,
ii , .
,
:
>>> num_topics_used = [len(model[doc]) for doc in corpus)
>>> plt.hist(num_topics_used)

:
250

200

i,! 150
:r

:s:
':S"

100

50

10

15

20

25

30

15

40

45

,
( ,
).
.

, , ,
. , -
r,
,
.

, 150
5 , 10 12 .
, 20
.
alpha.
'I ,
alpha, .
alpha ,
1. alpha,
. gensi alpha
1/num_topics, ,
Lctaoctel :
>>> model = models.ldamodel.LdaModel(
corpus,
num_topics= lOO,
id2word=corpus.id2word,
alpha= l)

alpha ,
.
, gensi
-
.
,1 20
25. ,
( ,
).
?
,
. ,
, iI.

........

4.

250

alpha
200

..

alpha= 1.0

150
%
IV
:1:
>,

8: 100
-::

50

10

15

20

25

30

35

40

45


, 1 .
.
:

4
5
6
7
8
9
1

dress military soviet president new state capt carlucci states leader
stance government
koch zambla lusaka onepany orange kochs i government
mayor new political
human turkey rights abuses royal thompson threats new state wrote
garden president
ill employees experiments levin taxation federal measure
legislation senate president whistleowers sposor
ohio july drought jesus disaster percent hartford mississippi crops
orther valley virginia
united percet illion year president world years states people i
bush news
hughes affidavit states united ounces squarefoot delaying
charged urealistic bush
yeutter dukakis bush covention farm subsidies uruguay percent
secretary general i told
kashmir goverment people srinagar india dumps city two
jammukashmir group moslem pakistan
workers vietamese irish wage immigrants percent bargaining last
island police hutto I

........

, , , n
, , - . , . ,
- ,
,
. ,
.
, , ,
:

, (
, <<1 ), -
-.
-, 1 ,
. , ,
- r
.
.

.
.
pytagcloud.

,
. ,
.

DD1111

4.


, . 1

. :-.1,
,
. ,
.
.
11-1
. ,
11 11,
. ,
, ,
.
,
, 11 11 , ,
! ,
(,
<, ,>, - << ,> ).

.
.

-
,
, . 1111
,
.
.
, , 1111 .
-
,
. , 11
; ,
, ,
('.. :-.1 1,
11:1, . ,

,
, ,
( ).
11, gensi 11
. ,
Nt1Py :
>>> from gensim import matutils
>>> topics = matutils.corpus2dense(model[corpus],
num_terms = model.num_topics)

topics : .
ii pctist
SciPy. .1
sum((topics[ti] - topics[tj])**2):
>>> from scipy.spatial import distance
>>> pairwise = distance.squareform(distance.pdist(topics))

;
distance (
):
>>> largest = pairwise.max()
>>> for ti in range(len(topics)):
pairwise[ti,ti] = largest+l

!
(
):
>>> def closest_to(doc_id):
return pairwise[doc_id].argmin()
, ,
:
,
( ,

, ,
).

- (
):
From: geb@cs.pitt.edu (Gordon Banks)
Subject: Re: request for information on "essential tremor" and

1m

4.

Indrol?

In article <lqltbnINNnfn@life.ai.mit.edu> sundar@ai.mit.edu


writes:
Essential tremor is progressive hereditary tremor that gets
worse
when the patient tries to use the effected memer. All lims,
vocal
cords, and head can involved. Inderal is beta-ocker and
is usually effective in diminishing the tremor. Alcohol and
mysoline
are also effective, but alcohol is too toxic to use as
treatment.

Gordon Banks NJXP 1 "Skepticism is the chastity of the


intellect, and
geb@cadre.dsl.pitt.edu I it is shameful to surrender it too
soon. 1'

- closest_
:

to < 1 > -

From: geb@cs.pitt.edu (Gordon Banks)


Subject: Re: High Prolactin
In article <93088.112203JER4@psuvm.psu.edu> JER4@psuvm.psu.edu
(John . Rodway) writes:
>Any comments on the use of the drug Parlodel for high prolactin
in the d?
>
It can suppress secretion of prolactin. Is useful in cases of
galactorrhea.
Some adenomas of the pituitary secret too much.

Gordon Banks NJXP 1 "Skepticism is the chastity of the


intellect, and
geb@cadre.dsl.pitt.edu I it is shameful to surrender it too
soon. "

,
.

1111DD

LDA ,
,
'I
. gensim,
.
, !
, ,
.
http: //dumps.
wikimedia.org. ( 10 ),
, ,
.
:
python -m gensim.scripts.make_wiki \
enwiki-latest-pages-articles.xml.bz2 wiki_en_output

,
Pytho. ,
, .
. ,
.
:
>>> import logging, gensim

d,
Python ( gensim
). , -
, :
> logging.basicConfig(
format='%(asctime)s : %(levelname)s : %(message)s',
level=logging.INFO)

:
>>> id2word = gensim.corpora.Dictionary.load_from_text(
'wiki_en_output_wordids.txt')
>>> mm = gensim.corpora.MmCorpus('wiki_en_output_tfidf.mm')

, L-, :
>>> model = gensim.models.ldamodel.LdaModel(
corpus = mm,
id2word=id2word,

m1

4.

num_topics lOO,
update_every = l,
chunksize = lOOOO,
passes = l)
=

.
, , .
,
:
>>> model.save('wiki_lda.pkl')

,
( , ,
):
>>> model

gensim.models.ldamodel.LdaModel.load('wiki_lda.pkl')

moctel
topics, .
, ,
, :-1 (
4 ):
>>> lens = (topics > ) .sum(axis = O)
>>> print(np.mean(lens))
6.41
>>> print(np.mean(lens < = 10))
0.941

, 6.4 94%
1 .
,
. , (
), ,
.
:
>>> weights = topics.sum(axis = O)
>>> words = model.show_topic(weights.argmax(), 64)

, ,
, , ,
. 18%
( 5,5%
1111 ). :

11111

.
,
. , ,
.

:
>>> words = model.show_topic(weights.argmin(), 64)

4.

, , ,
.
1,6 % , 11 , 1 % .


n - 100.
,
u
10 II 200. , IHOIIX 11111 1 II
. :
,
,
11.
, ,
100 200, ;
100 - ( 20
1< 1J ).
alpha. ,
, .

. ,
.
alpha
.

,
, 1<0 ,
,.
,
.

. 1111 11
, ,
, LDA,
- . 11
, 011 ,
1 . :-.1,
111\ .

, ,
. , ii , ,
, , ,
.
, .
u,
<.,>.
: <.. <<,> . . 1
, ,
. 1.
, <<J\111.
.
11 ,
rl', Python
Ruby.
gensi
(HDP). .
LDA gensim.models.
ldamodel. LdaModel HdpModel:
>>> hdp = gensim.models.hdpmodel.HdpModel(mm, id2word)

(, l'.111 -
). 1
, LD-,
.

.
, , ,
.
, gensi.

, 11 ,
,
.
.
, , ,

. LDA

4.

2003 , , gensi
, 2010 ,
HDP - 2011. ,

:-.i . (
u1U ,w, ) 111u u
( - ,
).

: . , .
,
.

f5.
ll - n
nnoxx
11 11 ,
.
1 3 ,
, 11 .
-
:
. , Stack0vert1o\V,
,
.

.
11111ii
1 u 1111 (
1-111). rr
.
, , ,
, , ?
, 1 u
.
( ,
) .
.



, .
, ,

111!1 1111

5. -


( caiie StackOverflo,v).
, , 11, ,
.
. 1111 ,
, ,
r 11
, , 110
. , II
ii .



-
, .
:
1 ?
?

- , -
,
. -
.
. - ,,
, ,,
.

(, ),
, . ,
II .
: 11111 , ,
(SVM), i'I .
, i'I , ii : - 11ii
, i'I .

1 , caiia StackOverflow
, StackExchange,
II Stack0vert1ow,
cc-wiki. r-
https://archive .org/details/
stackexchange. -
StackExchange. StackOverf\o\\'
, 11\IX ir stackoverflow. com-Posts. 7z
1 5,2 .
r.1 26
XML, u row,
posts:
<?xml version='l.0' encoding='utf-8'?>
<posts>
<row Id="4572748" Postypeid="2" Parentid="4568987"
CreationDate="2011-01-0lTOO:Ol:03.387" Score="4" ViewCount=""
Body="&lt;p&gt;IANAL, but &lt;a
href=&quot;http://support.apple.com/lt/HT293l&quot;
rel=&quot;nofollow&quot;&gt;this&lt;/a&gt; indicates to me that you
cannot use the loops in your
application:&lt;/p&gt;&#xA;&#xA;&lt;ockquote&gt;&#xA;
&lt;p&gt;...however, individual audio loops may&#xA; not
co111111ercially or otherwise&#xA; distributed on standalone basis,
nor&#xA; may they repackaged in whole or in&#xA; part as audio
samples, sound effects&#xA; or music beds.&quot;&lt;/p&gt;&#xA;
&#; &lt;p&gt;So don't worry, you can make&#xA; commercial music
with GarageBand, you&#xA; just can' t distribute the loops as&#xA;
loops.&lt;/p&gt;&#xA;&lt;/lockquote&gt;&#xA;" OwnerUserid="203568"
LastActivityDate="2011-0l-OlTOO:Ol:03.387" Co111111E1ntCount="l" />
</posts>

Id

Integer

PostTypeid

Integer

.
:
;
.
.

5. -

Parentid

Integer

, ( )

CreationDate

DateTime

Score

Integer

ViewCount

Integer Empty -

Body

String

HTML

OwnerUserid

Id

. 1,

Title

String

(
)

AcceptedAnswerid

Id


( )

CommentCount

Integer

,1
L-11. , ,,
J, J ,
. row 2012 ,
6 (2 323 184 II
4 055 999 J),
. , , XML,
. , 1.
L-
Python cElementTree , }{
.

m, 111, ,
, , , 1

1111

.
. ,
.
, PostTypeict , .
,
.
CreationDate
,
. score, , -
.
Viewcount , , .

,
. !
, t .
L- ,
.
owneruserid ,
, ,
. ,

( ,
stackoverflow. com-Users. 7z).
Title ,


.
Commentcount. Viewcount,
,
( = ?).
.
AcceptedAnswerid Score ,
.
,
- IsAccepted, 1
( Parentid=-1).
:
Id <> Parentid <> IsAccepted <> TimeToAnswer <> Score
<> Text

so_xml _to_tsv.
,

choose_instance. .

........

n 5. n - n nnoxx

. i'1 meta. j son


JSON,
, t. ,
meta[ Id]['Score' J. data.tsv
rct t,
:
def fetch_posts():
for line in open("data.tsv", "r"):
post_id, text = line.split("\t")
yield int(post_id), text.strip()

?

, .
, .
, ,
Iscceptect. .
. ,
, ii
.
, .
,
. ,
.
- II

. ,
, , ,
?
,
?
- . ,
, , ,
, - ,
:
>>> all_answers = [q for q,v in meta.items() if v['Parentid'] !=-1)
>>> = np.asarray([meta[answerid] ['Score']>O for answerid in
all_answers])

111111&1

ii ,
. ,
, : 011
, .
,
.

k
,
sklearn. sklearn.
u
neighbors. 1 1 111l\1
:
>>> from sklearn import neighbors
>>> knn = ne1ghbors.KNe1ghborsClass1fier(n ne1ghbors =2)
>>> print(knn)
KNeighborsClassifier(algorithm= 'auto', leaf _size= O,
metric = 'minkowski', n_neighbors = 2, =2, weights = 'uniform')

,
sklearn: fit( 1,

predict():
> knn.fit([[l],[2],[],[4],[5], [6]], [0,0,0,1,1,1])
>>> knn.predict(l.5)
array([OJ)
>>> knn.predict(37)
array([l])
>>> knn.predict()
array([O])

predict_
, 1,
:
r ( 1.

>>> knn.predict_proba(l.5)
array([[ 1., .]])
>>> knn.predict_proba(37)
array([[ ., 1.]J)
>>> knn.predict_proba(.5)
array([[ 0.5, 0.5]])

111111

5. -

11 11 ?

?
TimeToAnswer meta,
, .
t,
, .
( !)
.

r .
, r,
, . ,
, :
import re
code match

re.compile('<pre>(.*?)</pre>',
re.MULTILINE I re.DOTALL)
link match re.compile('<a href="http://.*?".*?>(.*?)</a>',
re.MULTILINE I re.DOTALL)
tag_match re.compile('<[ >]*>',
re.MULTILINE
re.DOTALL)
def extract_features_from_body(s):
link- count- in- code =

for match_str in code_match.findall(s):


link_count_in_code += len (link_match.findall (match_str))
return len(link_match.findall(s)) - link_count_in_code

L-
. ,
BeautifulSoup,
, L-.

.
, ,
.
.
nii
. :

1111111&1

LinkCount

0.8
0.7

"'
"'

..

0.6
0.5

0.4

& 03
:r
Q)

0.2
0.1

10

15

20

25

35

40

, ,
, ,
. , - , .

,
kNN (k )
:
= np.asarray{[extract_features_from_body(text) for post_id, text in
fetch_posts() if post_id in all_answers])
knn = neighbors.KNeighborsCl assifier()
knn.fit(X, )

,
SNN ( 5 ).
SNN?
, k.
, , k.

, .
, - -

1111

5. -

. (
) 11 1 ( ).
kn n .score <).
, , ,
1 ,
KFold sklearn .cross_v alidation.
, ,
, .
from sklearn.cross_validation import KFold
scores = [)
cv

KFold(n = len(X), k= lO, in dices= True)

for train , test in cv :


X_train , y_train = X[train ), Y[train )
X_test, y_test = X[test), Y[test)
clf = neighbors.KNeighborsClassifier()
clf.fit(X, )
scores.append(clf.score(X_test, y_test))
prin t("Mean(scores)= %.5f\tStddev(scores)=\.5f"\
%(n p.mean(scores), n p.std(scores)))
:
Mean(scores)= 0.50250 Stddev(scores)=0.055591
. 55%
. , -
.
- , kNN k=5.


. ,
, , .
<pre> </pre>.
.
def extract_features_from_body(s):
num code lin es =
lin k- coun t- in - code =
code free s = s
#

11111

for match_str in code_match.findall(s):


num_code_lines += match_str.count('\n')
code_free_s = code_match.su"", code_free_s)
# ,
#
link_count_in_code += len(link_match.findall(match_str))
links = link_match.findall(s)
link_count = len(links)
link count -= link count in code
re.sub('' +", '' '',
html free s
tag_match.sub(' , code_free_s)) .replace("\n", "")
html free s
link free s

- -

#
for link in links:
if link.lower().startswith("http://"):
link_free_s = link_free_s.replace(link, '')
num text tokens

html free_s.count(" ")

return num text tokens, num_code_lines, link_count

, , ,
:
NumCodeLines

0.07

00i6

0.06

=
1:::

0.014
111
0. 012
..:
: 0,010
:i:

111 0.05

0.04

..

S. 0.006
: 0.004

0.01
0.00

NumTextokens

0.018

0.002

1 00

200

3)0

400

soo

600

700

10 0

200

300

400

500

600

700

800


:
Mean(scores)=0.59800 Stddev(scores)=0.02600

, , 4 10 .
. -

5. -

, , , , .
.
AvgsentLen: . ,
11 ,
?
AvgWorctLen: ,
.
NumAllCaps: ,
; .
NumExclams: .

, , ,
.
AvgSentLen

0.035
0.030

05

: 0.025

: 0.4

"'

0.020

::i 0.3

::i
0.015

i . .?

0010

lo1

0005
.

AvgWordlen

&

50

101)

150

tt

2SO

NumAIICaps

0.6

00

10

12

NumExclams

1.0

10

J5

20

25

;iQ

3)

40

00

1, .
, :
Mean(scores)=0.61400 Stddev(scores)= 0.02154

1111 IID

. II
. ?
, , ,,
kNN. SNN
, : Linkcount, NumTextTokens,
NumCodeLines, AvgSentLen, AvgWordLen, NumAllCaps NumExclams,
.
,
.
( ,
=2, - :1
r). 1, .
, kNN , NumTextTokens ,
, NumLinks. ,
,
:

NumLlnks

NumTextTokens

20

25
23

, 1 ,
, , .
, kNN
.

?
.
. ,
?
. ,
?
? k,
, ,
.
k .
. ,
? , , -

11&1

5. -

, - .
, , ,
01111, , .
. , kNN
,
,
?
, ii ,
, -
n
. , ,
, ,
1111. ,
11 -.

1 ,
ct, . ,
,
. ,
. , .
. .
,
, 100- . ,
(
). , .
, , G 1111 ,
, 1. ,

. .
,
.
.
, . ,
.

, , .
11 1.

11111

,
.
: ,
.

, , ,
.
, .
, k, ,
.

, 1:1 ,

.


,
.
.
SNN,

, :
Bias-Variance for 'SNN'

1/j 0.6

'8.

5 0.4
i:::

02

500

1000

1500

200(

11111

5. -

, ,
, ,
, ,
0,4. - -
k, .
.
, ,
LinkCount NumTextTokens:
Bias-Variance for 'SNN'

1.0

" 0.
:,:

8.

0.4

.., .

0.2

soc

1000

1500

2000


. ,
.
k
- :
k

mean(scores)

stddev(scores)

40

0.62800

0.03750

10

0.62000

0.04111

0.61400

0.02154


. , ,
k=40, .

11&1

r11 r11

40
:
Errors for for different values of k

nrw

0.8

"
:r
3
0.4

0.2

20

80

40

1,)0

,
. .
.
,
. ,
.
, , .
,
- .
, , - -,
. ,
,
.

, -
. ,
;
- .

I......

5. -

, r ,
, r
r 11 , 1. ,
,
1 6.
, r,
1, - (). , ,
1, () > 0.5, .
1.2

10
0.8

u
u

0.6
0.4

o.z
nn
-0.Z

-5

lC

lS

, ,
, ,
1. ,
11 1.
ero .
, r, 1,
0.9 - ( = 1) = 0.9. 11
( = 1)/( = ) = 0.9/0.1 = 9. ,
1 9:1. ( = .5), 1
1:1. 11 11 ,
( ). ero ,

1 (
). , -
, , -
1.

111111&1


10

"

..

"5

-6

0.2

0.4

0.6

0.8

1.0

-8
.

0.2

0.4

0.8

1.0

,
(, ,
, ) log(odds).
; = 0 + 1;

)=

0 + 1 ( log(odds) ).
l- p;
l
.
.,' ; =
-<
1 + O +,' )
,
(;, )
. scikit-learn.
,
1.
log[_.f!.L

>>> from sklearn.linear_model import LogisticRegression


>>> clf = LogisticRegression()
> print(clf)
LogisticRegression(C = l.O, class_weight= None, dual=False,
fit_intercept= True, intercept_ scaling = l, penalty=l2, tol= 0.0001)
> clf.fit(, )
>>> print(np.exp(clf.intercept_ ), np.exp(clf.coef_.ravel()))
[ 0.09437188) [ 1.80094112)
>>> def lr_model(clf, ):
return 1 / (1 + np.exp(-(clf.intercept_ + clf.coef_*X)))
> print("P(x = -1)=%.2f\tP(x =7)= %.2f"%(lr_model(clf, -1),
lr_model(clf, 7)))
( =-1)= 0.05 ( =7)= 0.85

, , scikit-learn
- - intercept_.

J......

5. -

, ,
.
1.2
1
0.8
0.6

"'

.
0.2
.
-0.2

-5

15

lU

lO

,
r .
?
(k=40)
, .

,,.

mean(scores)

stddev(scores)
'

0.64650

0.03139

LogReg =1.

0.64650

0.03155

LogReg =10.

0.64550

0.03102

LogReg =.01

0.63850

0.01950

40NN

0.62800

0.03750


LogReg =.1


. ,
k . ,
.
-
=. 1 , -

1111 I

II ,
. , r
,
.
Bias-Variance for 'LogReg =.10'

1.0

-
0.6

"

i::::

0.6

0.4

0.2

500

1000

1500

2000

? , ,
,
.
,

.

-

, . ,
,
.
,
,
. , ,

1m

5. -

, 11,
, : <s\ ,,.
,
,
, .
, ,
. , ,
,
.

(FN)

()


(FP)
(TN)

, ,
, ,
. ,
, ,
, .
,
, . ,
.
:

= ---
TP + FP
, ,
,
:

= ---
TP + FN
-
, -
1111.

111111111

()

(+ FN)

(+ FP)

?
, , 0.5.
, FP FN,
1. ,
.
precision_recall
curve() metrics:
>>> from sklearn.metrics import precision_recall curve
>>> precision, recall, thresholds = precision_recall_curve(y_test,
clf.predict(X_test))

,
.
,
( )
( ) .

111!1
1

5. -

/ (AUC = 0.64) / nnoxe w

,.

/ (AUC = 0.68) / w

.8

.,

.
.

0.6

1.(

(,,()
(}(J

'

tH

IJt,


- (AUC).
.
.

, (
). ,
60 .
,
80% 40%. ,
. ,
( KFold())
,
. :
>>>
>>>
>>>
>>>

medium = np.argsort (scores) [int(len(scores) / 2)]


thresholds = np.hstack(((OJ,thresholds[medium]))
idxBO = precisions>=0.8
print("P=%.2f R=%.2f thresh=%.2f" % (precision[idxBOJ(0),
recall[idxBOJ[], threshold[idxBOJ[]))
=.80 R=0.37 thresh=0.59

. 59,
80% 37%. ,
.
. ,
,
.

........

,
predict_proba(),
, 11ii predict(),
.
>>> thresh80 = threshold[idx80}[}
>>> probs_for_good = clf.predict_proba(answer_features)[:,1}
>>> answer_class = probs_for_good>thresh80

, DI classification_report:

>>> from sklearn.metrics import classification_report


>>> print(classification_report(y_test, clf.predict_proba [:,1}>.63,
target_names = ['not accepted', 'accepted'}))

not accepted
accepted
avg / total

precision
0.59
0.73
0.66

recall
0.85
0.40
0.63

fl-score
0.70
0.52
.61

support
101
99
200

, ,
,
.


ii
. r

:J (clf. coef_).
, r
. ,
,
.
,
LinkCount, AvgwordLen, NumAllCaps
NumExclams, Numlmages (, ii
) AvgsentLen
. rJ
, , Numimages ,

DD1111

5. -

. , ,
. , , .
11 , ,
. -
.
n LogReg

0.4

0.3 .. ..

0.2

1 '

= 0.10

. . \ .

..

0.1

-0.l

'1:

"'

"'
,,,

Q.

!,,,'
<{

:,

"

"

"'

"

lJ

.::;

-g
':,:'

'"

s
7

..

"'

-"

'!.'

1-

'1:

"'
""
V>

::,

v
-"

::1

!
, .
, ,
.

:
>>> import pickle
>>> pickle.dump(clf, open("logreg.dat", "w"))
>>> cl f = pickle.load(open( "logreg. da t", "r"))

11111

, ,
. !

!
,
. ,
,
.
r .
, LinkCount, NumTextTokens, NumCodeLines,
AvgSentLen, AvgWordLen, NumAllCaps, NumExclams Numimages,
.
, ,
.

.
r
, ,
: ii .

scikit-learn.

r6.
1111 11 - 11
11
,
,
-. ,
111, l\1 : ,
1 1 J\ .
, << ,>,
, 11
. ,
ii ,
u
1.

, 11111, 1
1111 - 140:.
u
111 11111, , 11
11. T111111111>1ii - 111ii
u
ii, 11 :-.111111 1 11111111
, . i'1 111m,ii
- .
,
. :

1111, ;
11, II
;
.,111, 1,1 11 ii
scikit-lea1, 1111 :1 111,11\111.


, II
, : ,
.
5000 1, ii
(Niek Sanders);
.
, i\l
, - ,
, ,
1
, install.
, .
,
5000 .
ceiiac.
ii
:
>>> , = load_sanders_data()
>>> classes = np.unique(Y)
>>> for in classes: print("l%s: %1" % (, sum(Y==c)))
#irrelevant: 4 90
#negative: 487
#neutral: 1952
#positi ve: 4 33

load_sanders_data() II
- - 1
, , 3362
. L1, , ,
11
.
.



- ,

. , ,

1 6. n ll - ...
. ,
.
. .
?
<1'i,> 1 1,
r .
, 1111 .
. , ,
,
1 .

\ - ,
, .
,
. :v
; 1 ,
:v . ,
.
, ,
. -
-
.
:

( )

F,


awesome ()

crazy ( )

,
, 111 f1 11
Fi . ( CI f1 , F).
:\ P(Clf1 , F) ,
, ii:
() P(BIA) = () ( IB)

11111

, - ,
awesoe>,') <<cr-azy>,>, - 1 ,
1 ,
1<:
P(FI ,F) P(CIFI , F) = () P(FI , F2 IC)
(CI F1 , F) r :
( ) P(FI ,F2 I )
P(F1 ,F)
:
postaio1 likelihood
,-i = -------
erJidence
p1io1 evidence :
( ) - -
. ,
, .
P(F1 , F2 ) - ,
F1 F2
- (likelihood)
P(F1 , F2 I ). , ,
F1 F2 , , .
.


:
P(FI ,F2 I ) = P(FI I ) . P(F2 I ,FI )
,u ,
( P(F1 , F2 1 )) , (
P(F) C,F1 )).
, F1 11 F2 ,
P(F2 1 , F1 ) ( F2 1 ), :
P(FI ,F2 I) = P(FI I ) . P(F2 I )

, 1 :

111!1

6. 11- ...

: 111.11 , i'1
,.

r

ii 1, , :
P(F1 I = "pos")
_ ( = "pos") P(F1 1 = "pos")
_
------(- . pos "I F1 ,F,)--------P(FI , F)

_ ( = "g") P(F1 1 = "g") P(F1 1 = "g")


_. ..
( - neg 1 F1 , F,)
- - ------------------P(FI , F)

,." .
P(F,, F) II ,
, - t1
.
, ,
. , ,
.
:
v
, , 11
. , :

1"'st = argax ( =) P(F1 IC =) P(F.IC =)


, g
( r pos II neg) ,
.

- ,
. 6 r
, : <<a\veso111e>
11 <<crazy> 11 1.111 '111 , 111:

't

111111111

awesome

awesome

awesome crazy

crazy

crazy

crazy

u <,crazy,> , II

(, 4 > 4
). -
, :
( = "'pos") = i_
6
( = "neg")

=
=

0.67
0.33

, , ,
, .
P(F, 1 ) P(F) ) -
F, F2 .
,
, ,
. <<awesomei.>,
, , :
"

1 , aroesome
; 1.

"

(FI = 11 = pos )=

3
4

u
4awesome,>.
, <<awesomei.>
:

P(F,

"

= pos") = 1 - P(F,

11 = "pos"') = 0.25

(
, ):

IIEI

6. 11 II - ...

P(F,- = 1J = "pos") = 3._ = 0.5


4
P(F1

1J = "neg") = =

P(F2 = 1J C = "g")

= 1
2

,
. F1 F2
LJ :
=
=
=
=
P(_F1 , F2) = P(_F1 , F
2JC "pos") (_ "pos") + P(_F1 , F)C "g') (_ "g'')

:
3

24

24

P(F1 = 1'2
F = 1) =---+ 1- =446
6 4
P(F1 = 1'2
F =) = ---+- = 446
6 4
124 222

P(F1 = 'F2 = 1) = ---+--- = 4 4 6


2 2 6
15
, ,
.
:

F, F2 ,

awesome 1

3 2 4
---

P(C="pos"I Fi =l,F2 =0)== 1


-

-2 2
P(C="neg"IFi =l,F2 =0)==0
4

F, F2

crazy ..

1 2 4

( ="pos" 1 F.1 = ' F,- = 1) = 4. 4. 6 = _!_


5
5
12

2 2 2

( ="neg"J F; = O,F2 =l) =--0--=


12
awesome
crazy

3 2 4
P(C="pos"IF; =l,F2 =l)==l

4
2 2
P(C="neg"JF1 =l,F2 =l)==O
4
. ,
, . ,
, .
.

,
. 111 ,
. ,
. . ,

- . ,
, <<text,>. .
, ,
.
(add-oc sinootl1ig).

11111

6. 11- ...


. ,

, .
1, alpha<O,
.

,
.
,
, , , 11
. ,
, ,
, .

P(F., = I J ="pos")= -= . 75
4
3+1

P(F.,=\JC="pos")=

4+2

=0.67.

:1 2?
: <<\\'s,> <,crazy,>.
1 , ,
- . ,
1:
3+1 \+\ I
-

P(F.,=\JC="pos")+P(F.,=OJC="pos")=--+- =
4+2 4+2

.
, 111 ,
. ,
, .
, , N :
>>> import numpy as np
>>> np.set_printoptions (precision =20) numpy
# ( 8)
>>> np.array([2.48E-324])
array([ 4.94065645841246544177-324])
>>> np.array([2.47E-324])
array([ .])

,
, ii ,
0.0001,
65 ( 65
). :
2. 47-324?

> = 0.00001
>>> **64 #
le-320
>>> **65 # !

Python
doue . , , :
>>> import sys
>>> sys.float_info
sys.float_info(max = l.7976931348623157e+308, max _exp= l024,
max_l0_exp = 308, min=2.2250738585072014e-308, min_exp= -1021,
min_lO_exp = -307, dig=15, mant_dig=S, epsilon= 2.220446049250313e-16,
radix = 2, rounds = l)

,
mpmath (http: // code. goo gle. com/p/mp math/),
.
, NuPy.
, ,
:
log(x ) = log(x)+ log(y)
:

log () P(F1 1 ) P(F2 1 ) = log ()+ log P(F1 1 )+ log P(F2 1 )

1,
- . .
, , ,
.: .
:
, . ,
,
, q,
. ,
, P(C = 'pos"IF1 , F) > P(C = "neg"IF 1 , F), ,
log ( = 'pos"IF1 , F) > log ( = "eg"IF 1 , F)

11:r:J
I

6. 11- ..

.
.
-

-1
-2

-
-4
-5
-6
-7
.

02

0.4

0.6

08

10

,
,

es1 = argmax ( =) P(F1 IC =) P(F2 IC =)


, , ,
,
:

sc = argmax (log ( = )+ log P(F1 1 = )+ log P(F2 I = )



, ,
:

., =argmax(logP(C=c)+ I,logP(Fk IC=c))


ceN


, scikit-lea1n.
,
r .
,

1111 I

.
;,
. , ,
. 1. ,
.

ii - -
sklearn.naive_bayes.
Gaus sianN: ,
() . -
.
, .
, .
ultinomialNB: ,
, .
II TF-IDF.
ernoulliN: MultinomialNB,
, 11
<,- ;,, .
, , 1 , 1 MultinomialNB.

, 1
i'!. ,
,
(,
: ... http://liki;, ).
, , . ,
.
>>>
>>>
>>>

, true,
#
pos _neg_idx = np.logical_or(Y=="positive", Y=="negative")

>>>
>>>

X[pos neg_1dx]

1111
>>> '{

6. 11- ...

Y[pos_neg idx]

>>># ,
>>> '{
Y = ="positive"

1, -
: 1
.
, ,
. , ,
TfidfVectorizer
TF-IDF,
.
Pipeline,
, :
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
def create_ngram_model():
tfidf_ngrams = TfidfVectorizer(ngram_range =(l, ),
analyzer = "word", binary = False)
clf = MultinomialNB()
return Pipeline([('vect', tfidf_ngrams), ('clf', clf)])

Pipeline, create_ngram_model (),


,
.
,
.
KFold,
, ShuffleSplit.
,
.
- .

train_model(),
.
from sklearn.mecrics import precision_recall_curve, auc
from sklearn .cross_validation import ShuffleSplit
def train_model(clf_factory, , ):
# random_state,
cv = ShuffleSplit(n = len(X), n_iter = lO, test_size = O.,

3
random_state = O)
scores = []
pr_scores = []
for train, test in cv:
X_train, y_train = X[train], Y[train]
X_test, y_test = X[test], Y[test]
clf = clf_factory()
clf.fit(X_train, y_train)
train_score = clf.score(X_train, y_train)
test_score = clf.score(X_test, y_test)
scores.append(test_score)
proba = clf.predict_proba(X_test)
precision, recall, pr_thresholds =
precision_recall_curve(y_test, proba[:,l])
pr_scores.append(auc(recall, precision))
(np.mean(scores), np.std(scores),
np.mean(pr_scores), np.std(pr_scores))
print("%.3f\t%.3f\t%.f\t%.f" % summary)

summary

, :
>>> , = load_sanders_data()
>>> pos_neg_idx = np.logical_or(Y=="positive", Y=="negative")
>>>
X[pos_neg_idx]
>>> = Y[pos_neg_idx]
>>> = Y=="positive"
>>> train_model(create_ngram_model, , )
0.882
0.024
0.036
0.788


r TF-IDF 78.8%
/ 88.2%. /
( ,
, ) ,
,
.
.
100 %
.
.

1 6. 11- ...
1.0

/ (AUC = 0.90) / pos neg_

0.8

0.6

04
0.2

0.2

0.4

0.6

08

1.0


,
. ,
, ,
1, , , ,
.
, ,
- ? ,
,
, ,
:
def tweak_labels(Y, pos_sent_list):
pos = Y==pos_sent_list[O]
for sent_label in pos_sent_list[l:]:
pos 1= Y==sent_label
= np.zeros(Y.shape[O])
Y[pos] = 1
= Y.astype(int)
return

, <<,
. ,

. , ,
,
, :
>>> = tweak_labels(Y,

["positive", "negative"])

1 ( )
, -i
, .
>>> train_model(create_ngram_model, , , plot=True)
0.659
0.023
0.012
0.750

.
1.0

/11 (AUC = 0.6) / sent rest

.
.

0.6

04

0.2

.
.

0.2

0.4

0.6

0.8

1.0

, /
66%. - ,
, .
3362 920, 27 ,
.
, , ,
, 73%.
, ,
, - ,
.

() ? .

6. 11- ...

== Pos vs. rest ==


0.873 0.009 0.305 0.026
== Neg vs. rest ==
0.861 0.006 0.497 0.026

, . /
-
-
.
/ (AUC = 0.31) / pos

1.0

0.8

"'

0.6

1- 0.4
0.2

.
.

1.0

0.2

0.4

0.6

1.0

0.8

/ (AUC = 0.50) / neg

08

"'

0.6

1-

0.4

0.2

0.2

0.4

0.

0.8

1.0

,
, . ,
<< >: TfidfVectorizer MultinomialNB.
, , ,
.
TfidfVectorizer:
ii -:
(1,1);
(1,2);
, II (1,3).
min_df: 1 2;
IDF TF-IDF
use_idf smooth_idf: False True;
- - stop_words
english None;
(suinear_tf);
,
- binary True
False.
ultinomialNB:
alpha,
:
1, 11 : 1;
: 0.01, 0.05, 0.1 0.5;
: .
:
, ,
. ,
,
. ,
.
1111
<1 , scikit-learn
: GridSearchCV. -
( , ),
Pipeline :-.1
.

6. 11- ...

Grictsearchcv ,
,
. :
<estimator>_<subestimator>_..._<param_name>
,
ngram_range TfidfVectorizer (
Pipeline vect), :
param_grid=( "vect_ngram_range"= [

(1,

1),

(1,

2),

(1,

3)]

Grictsearchcv ,
ngram_range vect.

11ii .
,
ShuffleSplit,
1111 .
,1 - best_estimator_.
:.1
, -
. Shufflesplit
cv ( cv Grictsearchcv).
, , - , Grictsearchcv
.
score score_func.
sklearn.metrics.
metric.accuracy - (
, ).
-
. ,
, F-,
metrics . fl_score:

F=

2
+

, :
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import fl_score
def grid_search_model(clf_factory, , ):


cv

= ShuffleSplit(
n= len(X), n_iter= lO, test_size= 0.3,random_state = O)

param_grid = dict(vect_ngram_range = [(l, 1), (1, 2), (1, 3)],


vect min_df = [l, 2),
vect_stop_words= [None, "english"],
vect smooth_idf = [False, True],
vect_use_idf = [False, True],
vect_suinear_tf= [False, True],
vect_binary = [False, True],
clf_alpha= [O, 0.01, 0.05, 0.1, 0.5, 1],
grid_search = GridSearchCV(clf_factory(),
param_grid= param_grid,
cv= cv,
score_func = fl_score,
verbose= lO)
grid_search.fit(X, )
return grid_search.best_estimator_

:
clf = grid_search_model(create_ngram_model, , )
print(clf)

3 2 2 2 2 2 2 6 = 1152 ,
.
... ...
Pipeline(clf = MultinomialNB(
alpha =O.01, class_weight= None, fit_prior= True),
clf_alpha = O.01,
clf class_weight= None,
clf_fit prior= True,
vect= TfidfVectorizer(
analyzer= word, binary = False,
charset= utf-8, charset_error= strict,
dtype = <type 'long'>,input = content,
lowercase = True, max_df = l.O,
max_features = None, max_n= None,
min_df= l, min_n = None, ngram_range= (l, 2),
norm = l2, preprocessor= None, smooth idf = False,
stop_words= None,strip_accents= None
suinear_tf = True,token_pattern= (?u)\b\w\w+\b,
token_processor= None, tokenizer= None,
use_idf =False, vocabulary = None),
vect_analyzer = word, vect_binary = False,
vect_charset = utf-8,

6. 11- ...

vectcharset_error=strict,
vectdtype= <type 'long'>,
vect input=content, vectlowercase=True,
vect max_df=l.O,vectmax_features=None,
vectmax_n=None, vectmin_df=l,
vectmin_n=None, vectngram_range= (l, 2),
vectnorm=l2, vectpreprocessor=None,
vectsmooth_idf= False, vectstop_words=None,
vectstrip_accents=None, vectsuinear_tf=True,
vecttoken_pattern= (?u)\b\w\w+\b,
vecttoken_processor=None, vecttokenizer=None,
vectuse_idf= False, vectvocabulary=None)
0.007
0.702
0.795
0.028


/ 3.3%, 70.2.
, , .

() ,
:
Pos vs. rest ==
0.889 0.010 0.509 0.041
== Neg vs. rest ==
0.886 0.007 0.615 0.035

==

.
1.0

/ (AUC = 0.52) / pos

0.8

...

0.6

1-

"

. . . . , .. . . ... .

0.4

0.2

.
.

0.2

0.4

0.6

0.8

1.0

11111 ..


1.0

/ (AUC = 0.61) / neg

.
06
0.4
0.2

.
.

0.2

0.4

0.6

1.0

, / (,
,
, AUC
).
. ...


. -
. 140 ,
,
.
. , ,
.
preprocessor () TfidfVectorizer.
, ,
.
, ,
:
emo_repl = (


"&lt;": " good

........

6. 11- ...

":d": " good ", t :D


":dd": " good ", 11 :DD
"8) 11: " good ",
.. : _) .. : .. good ,. ,
:) ": " good ,
";) ": " good ",
"(-:": " good ",
"(:": " good ",

# :
":/": " bad ,
":&gt; 11 :

11

sad

":') : " sad ",


":-( 11 : " bad '',
":(": '' bad ",
":S": " bad ",
":-S": 11 bad

# , :dd :d ( )
emo_repl_order = [k for (k len,k) in reversed(sorted([(len(k),k)
for k in
emo_repl.keys()]))]

11
(\
):
re_repl = {

r"\br\b": "a re",

r"\bu\b": "you",
r"\bhaha\b": "ha",
r"\bhahaha\b": "ha",
r"\bdon't\b": "do not",
r"\bdoesn't\b": "does not",
r"\bdidn't\b": "did not",
r"\bhasn't\b": "has not",
r"\bhaven't\b": "have not",
r"\bhadn't\b": "had not",
r"\bwon't\b": "will not",
r"\bwouldn't\b": "would not",
r '' \bcan 't\b '': ''can not'',

r"\bcannot\b": "can not",


def create_ngram_model(params = None):
def preprocessor(tweet):
tweet = tweet.lower()
for k in emo_repl_order:
tweet = tweet.replace(k, emo_repl[k])

for r, repl in re_repl.items():


weet = re.sub(r, repl, tweet)

11111

return tweet
tfidf_ngrams
#

TfidfVectorizer(preprocessor=preprocessor,
analyzer= "word")

, , .


, 70.7%:
== Pos vs. neg
0.029
0.885
0.024
0.808
== Pos/neg vs. irrelevant/neutral
0.010
0.793
0.685
0.024
== Pos vs. rest
0.890
0.041
0.011
0.517
== Neg vs. rest
0.886
0.033
0.006
0.624


,
, , .
, ,
, ,
II .1. ,
r ?
, ,
, . ., , ,
.

(POS).
,
, ,
- , .
, ,
, 1 <<book,> -
( <<Tl1is is good book,> - << ,>)

1 6. 11- ..
( 4Cold yot1 please book the tlight?,> - <, r
?,>).
, , , NLTK
. , ero
. nltk.pos_tag()
,
Penn Treebank
(http://www.ci s.up enn. edu/-tr eeb a nk).
,
II .
>>>

i mport nltk

nltk.pos tag(nltk.word_token1ze(Th is is good book.))


[('This', ''), ('is', 'VZ'), ('', ''), ('good', 'JJ'), ('book',
'NN'), ('. ', '. ')]

>>>

> nltk.pos_tag(nltk.word_tokenize("Could you please k the flight?"))


[('Could', ''), ('you', 'PRP'), ('pl ease', 'VB'), ('book', 'NN'),
( 'th e', '' ) , ( 'flight' , 'NN' ) , ( '?' , '. ' ) ]
r1 Penn Treebank (.
http://www . a nc.org/OANC/p enn. html):

or

CD
DT

2,second

the

there

there are

FW

kindergarten

IN

, of, like

JJ
JJR

cool

cooler

JJS

coolest

LS
MD

1)

could, will

book

NN

11111 DI

NNS

books

NNP

Sean

NNPS

Vikings

PDT

both the boys

POS
PRP
PRP$
RB

friend's

1, he, it

my, his

however, usually,
naturally, here, good

RBR
RBS
RP

u
VB
VBD
VBG

better

best

give

to

to go, to him

uhhuhhuhh

take

took

taking

VBN

taken

VBP

take

VBZ


R ,

takes

WDT

wh

which

WP

wh

who, what

WP$

R wh

whose

WRB

wh

where, when

,
pos _ tag (). , -

IJ

6. 11- ...

, NN, JJ.JI
111 v, - JJ, - RB.

SentiWordNet
,
, , :.1 ,
: SentiWordNet (http://sentiworctnet.isti.cnr.it).
, 13 ,

. ,
1 .
.

,,

PosSore NegScore SynsetTerms

,. 'j

00311354

0.25

0.125

studious#1

00311663

0.5

careless#1

03563710

, ,
;

careful

implant#1

00362128

kink#2
curve#5
curl#1

,
;

<,
> <,book () k (
). PosScore NegScore
i' , 1-PosScore-NegScore.
Synseterms , .
. .
111, ,
. ,
<<fantasize,> , , , 11 . :

'

- . "J

PosScore NegScore SynsetTerms

01636859

0.375

fantasize#2
fantasize#2

01637368

0.125

fantasy#1
fantasize#1
fantasize#1

;
,

, ,
,
. ,
, << .,.

, . <,fantasize,>
PosScore 0.1875, NegScore - 0.0625.
load_sent_word_net()
,
/, n/implant, -
.
import csv, collections
def load_sent_word_net():
# , ,
#
#
sent_scores = collections.defaultdict(list)
with open(os.path.join(DATA_DIR, SentiWordNet_.0.0_20130122.txt"),

"r")

as csvfile:

reader = csv.reader(csvfile, delimiter = '\t', quotechar='"')


for line i reader:
if line [] .startswith ("#"):
continue
if len(lie)==l:
continue
POS, ID, PosScore, NegScore, SynsetTerms, Gloss
if len(POS) = =O or len(ID)==O:
continue
for term in SynsetTerms.split(" "):
#

line

6. 11- ...

term = term.split("#") [)
term = term.replace(''- 1 ', '' '') .replace(''_", '' 1 1)
key = "%s/%s"%(POS, term.split("#") [])
sent_scores[key] .append((float(PosScore), float(NegScore)))
for key, value in sent scores.items():
sent_scores[key] = np.mean(value, axis = O)
return sent scores



. aseEstima
tor, :
get_feature_names(): , transform():
fit(document,

y=None):

, ,
self;
transform(documents): numpy.array(),
(len(documents), len(get_feature_names)).
, documents
,
get_feature_names().
.
sent word net = load sent word_net()
class LinguisticVectorizer(BaseEstimator):
def get feature_names(self):
return np.array(['sent_neut', 'sent_pos', 'sent neg',
nouns', 'adjectives', 'verbs', 1 adverbs',
'allcaps', 'exclamation', 'question', 'hashtag',
'mentioning'])
# , ,
# : fit(d) .transform(d)
def fit(self, documents, y=None):
return self
def _get_sentiments(self, d):
sent = tuple(d.split())
tagged = nltk.pos tag(sent)
pos_vals = []

neg_vals = []
nouns = .
.
adjectives
verbs = .
adverbs = .
for w,t in tagged:
, n = 0,0
sent_pos_type = None
if t.startswith("NN"):
sent_pos_type = "n"
nouns += 1
elif t.startswith("JJ"):
sent_pos_type = ""
adjectives += 1
elif t.startswith("VB"):
sent_pos_type = "v"
verbs += 1
elif t.startswith("RB"):
sent_pos_type = "r"
adverbs += 1
if sent_pos_type is not None:
sent_word = "%s/%s" % (sent_pos_type, w)
if sent_word in sent_word_net:
p,n = sent-word-net[sent-word]
pos_vals.append(p)
neg_vals.append(n)
1 = len(sent)
avg_pos_val = np.mean(pos_vals)
avg_neg_val = np.mean(neg_vals)
return (1-avg_pos_val-avg_neg_val, avg_pos_val, avg_neg_val,
nouns/1, adjectives/1, verbs/1, adverbs/1)
def transform(self, documents):
obj_val, pos_val, neg_val, nouns, adjectives, \
verbs, adverbs = np.array([self._get_sentiments(d) \
for d in documents]) .
allcaps = []
exclamation = []
question = []
hashtag = []
mentioning = []
for d in documents:
allcaps.append(np.sum([t.isupper() \

6. 11- ...

for t in d.split() if len(t)>2]))


exclamation.append(d.count("!"))
question.append(d.count("?"II
hashtag.append(d.count("#"))
mentioning.append(d.count("@"))
result = np.array( [obj_val, pos_val, neg_val, nouns, adjectives,
verbs, adverbs, allcaps, exclamation, question,
hashtag, mentioning]).
return result



.
TfidfVectorizer .
Featureunion scikit-learn.
, Pipeline,
, ,
FeatureUnion
.
def create_union_model(params= None):
def preprocessor(tweet):
tweet = tweet.lower()
for k in emo_repl_order:
tweet = tweet.replace(k, emo_repl[k])
for r, repl in re_repl.items():
tweet = re.sub(r, repl, tweet)
return tweet.replace("- 11 ,

" ") .replace("

11

")

tfidf_ngrams = TfidfVectorizer(preprocessor = preprocessor,


analyzer= "word")
ling_stats

LinguisticVectorizer()

all features

FeatureUnion([('ling', ling_stats),
( tfidf', tfidf_ngrams)])

clf = MultinomialNB()
pipeline = Pipeline([('all', all features), ('clf', clf)])
if params:
pipeline.set_params(**params)
return pipeline

1m1


ii /
1 0.4%.
== Pos vs. neg
0.810
0.023

0.890

0.025

== Pos/neg vs. irrelevant/neutral


0.791
0.022
0.691
0.007
== Pos vs. rest
0.011
0.890
0.529

0.035

== Neg vs. rest


0.883
0.007
0.617

0.033

t ime spent: 214.12578797340393


(111111 )
II 1, ,
, -
(pos/neg versus irrelevant/eutral), , ,
, .

, ! ,
, ,
.
,
,
. ,
11, 1 .
, , - ( ,
) . ,
,
sentiWorctNet.
r .

rllABA 7.
r
, ,
. - 11
(OLS). 200 , 110
- 11
. 11 ,
scikit-lear11.
1111. t.
.1, , ,
, . 1,1
1 . - , .1111
. Lasso. rpeenoii perpecc1111 11
. 01111 scikit-learn, 11 1,1 .

- 1
. 1
. 11- < 1111 II
11, : 111,
. - :}
; . ,
, .
, n11 1t1111 scikit-lca1,
, :
>>> from sklearn.datasets import load boston
>>> boston = load_boston()

boston 1, n; <11, boston.


data , boston. target - .

...

m1

,
, l\1
.
(
boston. DESCR boston. feature_names):
>>> from matplotlib import pyplot as plt
>>> plt.scatter(boston.data[:,5], boston.target, color = 'r')

oston.target (
).
, , , .
:
>>> from sklearn.linear model import LinearRegression
>>> lr = LinearRegression()

LinearRegression 113 sklearn.


.
,
scikit-learn.

linear_model

>>>
>>>
>>>
>
>>>

boston.data[:,5]

boston.target
np.transpose(np.atleast_2d(x))

lr.fit(x, )
y_predicted = lr.predict(x)

-
np.atleast_2d,
. , fit
, . ,
,
.
, fit predict
LinearRegression - , , . API -
scikit-learn.
( )
( ). ,
.

, .
, -

7.


metrics:

mean_squared_er1ror

sklearn.

>>> from sklearn.metrics import mean_squared_error

60

..

50

.,

. .. . . .

40
30

20
10

-10

7
5
6
8
(RM)

10

:
:
>>> mse = mean_squared_error(y, lr.predict(x))
>>> print("Kapa (
): {:.J".format(mse))
( ): 58.4

,
, r
(RMSE):
>>> rmse = np.sqrt (mse)
>>> print("RMSE ( ): {:.3)".format(rmse))
RSE ( ): 6.6

RMSE ,
, .
,
, 13 .

r ...

111111&1

, r
.
,
, RMSE
.
, ,
.

, 6.6,
. ?
- .
,
.
II .
:
1-

MSE
I' (' - jJ' )2 :::1----2

L;(Y;

-)

VAR(y)

; - i- , .( - ,
. , -
, 11 .,
.
, 1111
. ,
1, - .
, ;
, ,
.
r2_score sklearn. metr ics:
>>> from sklearn.metrics import r2_score
>>> r2 = r2_score{y, lr.predict(x))
>>> print{"R2 ( ): {:.2J".format(r2))
R2 { ): 0.31

. 1,
, ,
R.

lfil

7.

, ,
.
- 1{ ,
score LinearRegression:
>>> r2 = lr.score(x,y)


- .
11,
. (
), .
r11 , . ,
oston. ctata fit
:
>>> = boston.data
>>> = boston.target
>>> lr.fit (, )


4.7,
0.74. , , , ,
. ,
, ' 14-:., r ,
.
1111
. :
>>>
>>>
>>>
>>>
>

= lr.predict()
plt.scatter(, )
plt.xlabel (' ')
plt.ylabel(' ')
plt.plot( [ y.min(), y.max() ] , [ [ y.min() ] , [ y.max () ] ] )

,
, .
. ,
ril ( ,
, II
).

...
60
50

"'
:z:

40

30

20

...

10

-10

. ,.
.
.. ... ..
1
.

.,

10

20

30

40

50

60


,
.
. ,
. ,

. , .
, .
, scikit
learn.
Kfold
:
>>>
>>>
>>>
>>>

from sklearn.cross validation import Kfold


kf = KFold(len(x), n_folds= 5)
= np.zeros_like(y)
for train,test in kf:
lr.fit(x[train], y[train])
p[test] = lr.predict(x[test])

11111

7.

>>> rmse_cv = np.sqrt(mean_squared_error(p, ))


>>> print('RMSE 5-n : (:.2)'.format(rmse cv))
'RMSE 5-n : 5.6

11
( 11 ): s. 6. II
, ,
, ,
.

, . .
11
.
, 01111 .
.

, r
.
1111
. .
,
. ,
.
- ,

. , ,
, ,
.

-
. ,
, .
, ,
. ,
, () .

, ...

L 1 L2
. ,
; ,
scikit-learn.
(
- , -
) . ,
*.
:

-.

=argmin I ji-Xb- 1-'

,
.
, ,
,
.
, ,
:-.,
. ,
: Ll L2. Ll ,
,
L2 - .
Ll,
:

r
, ( ).
L2 :

, , :
, .
.

7.
, Lasso n1

. L 1-
Lasso', L2-
- .
.

Lasso, (
, ) 1111,
. Lasso 1,
- !
,
, .
,
, .
,
, .
, ,
( = ,
11,11 ).
:.
, Lasso
.
Lasso 11 , ,

, 1 :
, ,:

- ,
1 2 , 11
.

Lasso
scikit-learn
v

, , 111\1 1111
. scikit-lear 11
:\t ElasticNet:
Lasso llt' It'('T IIIIIOIKOJ OTIIOIIICIIII 11 11 IIOTOMY lll'1111111.
;iblipe1111aypa ll'ast abso\11tc s\11i11k(t' 111\ seleetim1 operatu1-. 1. 11.

11 , ...

111111

>>> from sklearn.linear_model import ElasticNet, Lasso


>>> en = ElasticNet(alpha = 0.5)

: en
lr. .
5.0 ( 4.6),
I< 5.4 ( 5.6).

. ,
Lt, Lasso, L2, Ridge.

Lasso
scikit-learn ,
().
,
Lasso:
>>> las = Lasso(normalize = l)
>>> alphas = np.logspace(-5, 2, 1000)
>>> alphas, coefs, _ = las.path(x, , alphas = alphas)

alphas path Lasso


, ;
. ,
.
,
:
>>> fig,ax = plt.subplots()
>>> ax.plot(alphas, coefs.T)
>>>#
>>> ax.set_xscale('log')
>>># alpha n n
>>> ax.set_xlim(alphas.max(), alphas.min())

, (
):

.
. ,
( , ),
- .
,
, . -

7.

, , ,
r .

..

:1

,.

::i:

,;s
s


-1
2
-3
10'

10

10''

10

10'

10'

10'

--N

- 1u1 ,
. 1990- , -
, ,
N,
N (
).
--N.
, ,

( ).
20 (
r ;
10 ).
, ,
, .
, .
, .
, ,

, r...

11111

, , 11 11u11
, (
, ).
- -
, .
, 01 .
,

.

,
11-.
10-,
(SEC).
,
. :
u
.
, .
16 087 . ,
, - , 150 360. ,
, , .
, i'I
, ,
.
SVMLight
, .
scikit-learn .
, SVMLight - ,
scikit-learn;
:
>>> from sklearn.datasets import load_svmlight_file
>>> data,target = load_svmlight_file('E2006.train')
ctata - (
, ),
target - npocoii .
target:

7.
>>> print('M target: ()' .format(target.min()))
target: -7.89957807347
>>> print(' target: ()' . format(target.max ()))
target: -0.51940952694
>>> print('Cpeee target: ()'.format(target.mean()))
target: -3.51405313669
>>> rint(' target: ()'.format(target.std()))
target: 0.632278353911

, -7.9 -0.5.
, ,

. ,
.
>>> from sklearn.linear model import LinearRegression
>>> lr = LinearRegression()
>>> lr.fit(data,target)
>>> pred = lr.predict(data)
>>> rmse_train = np.sqrt(mean_squared_ error(target, pred))
>>> print('RMSE : (:.2)' .format(rmse_train))
RMSE : 0.0025
>>> print('R2 J.W< : (:.2)'.format(r2_score(target, pred)I)
R2 : 1.0

- 11 , .
1.0. ,
.
(
, )
: RMSE 0.75,
-0.42.
<<.> -3.5,
!

,

, .
, . ,

.

, ...

111111D1

-
q.
, ElasticNet
0.1:
>>> f rom sklearn.linear_model import ElasticNet
>>> met = ElasticNet(alpha = O.l)
>>> kf = KFold(len (target), n _folds= 5)
>>> pred = n p.zeros_like(target)
>>> for train, test in kf:
met.fit(data[train], target[train])
pred[test] = met.predict(data[test])
>>>
RSE
>>> rmse = n p.sqrt(mean_squared_error(target, pred) )
> print(' [EN .1] R3 :J. .11J-1i( (5 :>): (: .2) ' .forrnt(rmse))
[EN 0. 1 ] RMSE ( 5 ): 0.4
>>>
>>>
>
[EN


r2 = r2_score(target, pred)
print( [EN 0.1] R2 , (5 :>): {: .2)'.forrnt(r2))
0.1) R2 (5 ): 0.61

RMSE . 4, R2 . 61 - ,
.
- .
(1. ), ().
,
, .
(
, ,
, ).
,
scikit-lear.


0.1. 0.7 23.9. ,
. ,
.
, . ,

7. r11

,
,
( ).
?
: . .
.
, ( ,
),
.
.
, :
,
. , .
.
,
. ,
. .

,,

......__
2

'

.,..

norpynn

"""'

,
. .
,
,
; ,
.
, ,
. ,
(
), <,
.

, r...

........

, .
, ,
, , , - -
.
, scikit-lca
; Lassocv, Ridgecv ElasticNetcv,

. 1 ,
:
>>> from sklearn.linear_model import ElasticNetCV
>>> met = ElasticNetCV()
>>> kf = KFold(len(target), n_folds= 5)
>>> = np.zeros_like(target)
>>> for train,test in kf:
met.t(data[train],target[train])
p[test] = met.predict(data[test])
>>> r2_cv = r2_score(target, )
>>> print(R2 ElasticNetCV: (: .2}.format(r2_cv))
R2 ElasticNetCV: 0.65

,
(,
). ,
. scikit-learn;
, n_jobs
ElasticNetcv., 4 , :
>>> met = ElasticNetCV(n_jobs= 4)
n_jobs -1, 11
:
>>> met = ElasticNetCV(n_jobs= -1)
, ,
, r ElasticNet , L1 L2.
,
ll_ratio. 1 2
v
( 1 ll_ratio):

1 =
2 =(1-)
, r
, 11_ratio -
, L1 L2.

........

7. r

, ElasticNetcv
ll_ratio, :
>>> ll_ratio = (.01, .05, .25, .5, .75, .95, .99]
>>> met = ElasticNetCV(ll_ratio = ll_ratio, n_jobs = -1)

ll_ratio .
, (
11_ratio 0.01 0.05), , Lasso (
11_ratio 0.95 0.99). ,
.

ElasticNetcv -
, -
. ,
:
ll_ratio = (.01, .05, .25, .5, .75, .95, .99]
met = ElasticNetCV(ll_ratio =ll_ratio, n_jobs = -1)
= np.zeros_like(target)
for train,test in kf:
met.fit(data[train],target[train])
p[test] = met.predict(data[test])
>>> plt.scatter(, )
>>>* )
>>>
( )
> plt.plot([p.min(), p.max ()], [p.min(), p.max ()])
>>>
>>>
>>>
>>>

ii :

-1

"'
"'
:t

"'

-2

:t

8
"'....
:t

-3
-4
-5
-
-9

-1

-6

-5

-4

-2

-1

111111

, . ,
, (, ,
).
:
scikit-learn
. .

11
-
. , -
. 1111
,
,
.
, Lasso ceei'r.
,

: , , ,
, .
ii -
,
.
scikit-learn
,
.

( scikit-learn ElasticNetcv).
- , , ,
:" .
Lasso ,
.
,
11 . r
, xopoeii .
,
.
.
.

r.nABA8.

11
. 11 1 11
(,
, ).

11i'1. ,
,
, , 111,11,
()
.
, ,
,
. ,
. ,
. 110
: .

l\1 : 1111: :.
11 , , , -
, r, ,
. 11, , - ;
. , , -r
<, ,>.
-,.

10 -.
, 111111. ; :.r,

1m1

Aazon, <., 1,11 ,


,>. n <<
,>. 11 1<11
, 11.
1<n
n 11m ,
Netllix. Netflix (

) .
DVD-, Netflix
. .
,
. Netflix
, .

, ,,
, .
2006 Netflix
, , ,
, .
. ,
n 10
, . 2009
Bel\Ko1's Pagatic Cl1aos
Nettlix . 20
, The Ensemle,
10- , - ,
.

i\)

Netflix , , ,
. ,
,
.
, ,
.
,
.
:
. -
, :
.

8.

.

.

Nettlix Prize, , ,
. , Nettlix
10% (
,
). ,
10%. 20%
.
! 20 ,
.
,
. ,
, 11
1111 .
1,
.
Groplens .
Nettlix
? : i' II
. ,
.





. : -
(, 10 ) , 1
. 1
, .
, it :
def load():
import numpy as
from scipy import sparse
data = np.loadtxt('data/ml-lOOk/u.data')


ij = data( : , : 2]
ij -= 1
# 1
values = data(:, 2)
reviews =. sparse.csc_ matrix( (values, ij. )) . astype (float)
return reviews.toarray ()

,
.
>>> reviews = load()
>>> U,M = np.where(reviews)

random

>>> import random


>>> test_idxs = np.array(random.sample(range(len(U)), len(U)//10))

train, 111 reviews ,


, , :
>>> train = reviews.copy()
>>> train[U(test_idxs], M(test_idxs]] =

test, :
>>> test = np.zeros_like(reviews)
> test[U[test_idxs], M(test_idxs]] = reviews[U(test_idxs],
M(test_idxs]]

,
,
. ,
(, ) .

, ,
,
. ,
: z-.
,
scikit-lear,
(- ,
).
,
.

8.

,
, scikit-learn API
:
class NormalizePositive(object):

. 111110
, .
, , NmPy:
def init (self, axis=O):
sel f.axis = axis
- fit.
, . ,
<< ,>:
def fit(self, features, y=None):
axis 1,
:
if self.axis == 1:
features = features.T
# ,
inary (features > )
countO = inary.sum(axis=O)


countO [ countO == ]
1.


self.mean = features.sum(axis = O)/countO
# , inary True
diff = (features - self.mean) * inary
diff **= 2
# , 0.1
self.std = np.sqrt(0.1 + diff.sum(axis= O)/countO)
return self
0.1
, ,
, .
,
.
transform
inary:

def transform(self, features):


if self.axis == 1:
features = features.T
binary = (features > )
features = features - self.mean
features /= self.std
features *= binary
if self.axis == 1:
features = features.T
return features
, , axis 1,
, ,
, .
inverse_transform :
def inverse_transform(self, features, copy=True):
if :
features = features.copy()
if self.axis == 1:
features = features.T
features *= self.std
features += self.mean
if self.axis == 1:
features = features.T
return features
, fit_trnsfrm fit transform:
def fit_transform(self, features):
return self.fit(features) .transform(features)
(fit, transform, transform_inverse
fit_transform) - , sklearn.
preprocessing.
, ,
, ,
.

:
.
: ,
,
, u .

DD1111

8.

, , ,
.
- , {1
, .
, 1,
1,, , , 1,1
. , :
, , ,
, . ,
, -
, IOl{II (, 11
1111, :-, ii, ).
t , , :
. ,
- .
ii 1, ( 1111 , u ., ,,
).
>>>
>>>
>>>
>>>
>>>
>>>
>>>

from matplotlib import pyplot as plt


,
norm = NormalizePositive(axis = l)
binary = (train > )
train = norm.fit_transform(train)
# - w. 200200
plt.imshow(binary(:200, :200], interpolation = 'nearest')

",()

l')()

l>J

lt)

()

10

1111DD

, -
. ,
, ,
.
,
. .
1.
.

( ,
).
2. (, ),
, ,
:
.
.

scipy.spatial.ctistance. pctist.
,
, ,
.
1 - r, r - . .
>>> from scipy.spatial import distance
>>>#
>>> dists = distance.pdist(inary, 'correlation')
>>># , dists[i,j] >>># binary[i] binary(j]
>>> dists = distance.squareform(dists)

:
>>> neighbors = dists.argsort(axis= l)
,
:
>>>#
>>> filled = train.copy()
>>> for u in range(filled.shape[O]):
# n_u -
n u = neighbors[u, 1:]
#-t u -
for m in range(filled.shape(l]):

........

8.

# '
revs = [train[neigh, m]
for neigh in n_u
if binary [neigh, m]]
if len(revs):
i n -
n = len(revs)
# 1
n //= 2
n += 1
revs = revs[:n]
filled[u,m] = np.mean(revs

-
, ,
. , 111
( rev [ :n J) .
, ,
n .
- .

:
>>> predicted = norm.inverse_transform(filled)

, :
>>> from sklearn import metrics
>>> r2 = metrics.r2_score(test[test > ], predicted[test > ])
>>> print('Oea R2 ( ): (:.1%}'.format(r2))
R2 ( ): 29.5%


, ,
. ,
,
, .
, :
>>> reviews = reviews
>>># , ...
>>> r2 = metrics.r2_score(test[test > ], predicted[test > ])
>>> print('Oea R2 ( -): {:.1%}'.format(r2))
R2 ( -): 29.8%

1< , . 11
11ii,
, .

1111 I


.
,
.. ,
, .
.
. ,
1:1 5- 4 ,
, .
.
, 4.3
. , 3.5,
4.
11,
- .
.

, : ,
. ,
. ,

. ,
. ,
,
( ,
, ).
train test , (
). .
:
>>> reg = ElasticNetCV(alphas = [
1

0.0125, 0.025, 0.05,

.125, .25, .5,

1., 2.,

4.J)

, ii
(, ). ii
:
>>> filled = train.copy ()
,1
,
:

8.
>>> for u in range(train.shape[O]):
curtrain = np.delete(train, u, axis =O)
# binary
bu = binary[u]
#
reg.fit(curtrain[:,bu] ., train[u, bu])
#
filled[u, -bu] = reg.predict(curtrain[:,-bu] .)

, :
>>> predicted = norm.inverse_transform(filled)
>>> r2 = metrics.r2_score(test[test > ], predicted[test > ])
>>> print('Oea R2 ( ): (:.1%) '.format(r2))
R2 ( ) : 32. ?,

:-. ,
, .

.
, ?
, , - . ,
, ,
.
, - .
,
- , .
? . , !

,
: .
( ) -
, .
,
, , -
,
. t!1 ,
:
, ,
.
,
.

1&1


r (stacked learning). ,
,
r .

r. r :
r 1
l:ltl

n
111

n111

,
. r
(
).
, ,
.
, r,
: ,
,
:
>>> train,test = load_ml100k.get_train_test(random_state = l2)
>>> t Now split the training again into two subgroups
>>> tr_train,tr_test = load_mllOOk.get_train_test(train,
random_state = 34)
>>># :
>>>#
regression.predict(tr_train)
>>> tr_predictedO
>>> tr_predictedl
regression.predict(tr_train.T) .
>>> tr_predicted2
corrneighbours.predict(tr_train)
>>> tr_predicted
corrneighbours.pred1ct(tr train.T) .
>>> tr_predicted4
norm.predict(tr_train)
norm.predict(tr_train.T) .
>>> tr_predicted5
>>>
>>> stack_tr = np.array([
tr_predictedO[tr_test > ],
tr_predictedl[tr_test > ],

11111

8.

tr_predicted2[tr_test > ],
tr_predicted[tr_test > ],
tr_predicted4[tr_test > ],
tr_predicted5[tr_test > ],
] ) .

>>>#
>>> lr = linear_model.LinearRegression()
>>> lr.fit(stack_tr, tr_test[tr_test > ])

:
>>> stack_te = np.array([
tr_predictedO.ravel(),
tr_predictedl.ravel(),
tr_predicted2.ravel(),
tr_predicted.ravel(),
tr_predicted4.ravel(),
tr_predicted5.ravel(),
] ).
>>> predicted = lr.predict(stack_te).reshape(train.shape)

, :
>>> r2 = metrics.r2_score(test[test > ], predicted[test > ])
>>> print('R2 : {:.2%}'.format(r2))
R2 : 33.15%

, r
. :
, -
.
,
r ,
.
, ,
.
,
.
,
. .
, ,
.
,
, .
.

.

, ,
.
,
.
-
. ,
, ,
. ,
, , ,
.
, ,
, -
. - Amazon.
< ,>, ,
:
Customers Who Bought Tt,is Item Also Bought

Anna Kare11ina

The Brothers Karamazov

The ldiot (Vintage Classics)

**1,'t:}W( (289)
Paperback

kkk'fd( (248)
Paperback

-(. -A:'f;dr (57)


Paperback

$10.35

$11.25

Leo Tolstoy

Fyodor D,stoevsky

Fyodor Dostoevsky

$10.

,
, ,
. , Gmail
,
, ( ,
Gmail; , ,
).
, , ,
. , , ,

, -

t1

8.

. , - ,
, 111 .

,
. ,
, ,
.
, ,
- . ,
. .
1990- ,
,
,
. ,
. , ,
, ,
( ).

, >,
-
(. zn. );
. ? 1111
,

.
, ,
( ,
50% ). ,
, ,
1111, , . ,
, ,
.
, 1 .
: <<
, ,
, , ,>.
, n, , II ,
, . ,

1111

, ii
, ,
.

:\ ,
.
( B1ijs) r .
11 ,
1, ,
.
, .

( ).
>>> from collections import defaultdict
>>> from itertools import chain
>>>#
>>> import gzip
>>>#
>>># '12 34 342 5 ...'
> dataset = [[int(tok) for tok in line.strip().split()]
for line in gzip.open('retail.dat.gz')]
>>>#
>>> dataset = [set(d) for d in dataset]
>>>#,
>>> counts = defaultdict(int)
>>> for elem in chain(*dataset):
counts[elem] += 1

:
-n'
1


2224

2-3

2438

4-7

2508

8-15

2251

16-31

2182

32-63

1940

64-127

1523

128-511

1225

512

179

1m

8.

. , 33%
. 011
1 . , ,
, 1111.111 ,
,
.
, .

, . scikit-learn
.
Apriori, (
1994 (Rakcsh Agrawal)
(Raakisla Sikant)), (
, ,
).
, Apriori
( ) ,
( , 1<
).
:
( ),
. 1, 111
,1 (, !\1 ..
- minsupport).
, . !\1
11
. ,
.

. . Aprioi ,
,
.
.
, .
:
>>> minsupport = 80

,
. Aprii -
u{1 oopoii. t


, minsupport,

:
,
:
>>> valid = set(k for k,v in counts.items()
if (v >= minsupport))

.
-
:
>>> itemsets = [frozenset([v]) for v in valid]

:
>>> freqsets = []
>>> for i in range(l):
nextsets = []
tested = set()
for it in itemsets:
for v in valid:
if v not in it:
# , v
= (it I frozenset([v]))
i ,
if in tested:
continue
tested.add()
# onopy,

........

8.

# .
# 'apriori.py' w.
support_c = sum(l for d in dataset if d.issuperset(c))
if support_c > minsupport:
nextsets.append(c)
f reqsets .extend(nextsets)
itemsets = nextsets
if not len ( itemsets) :
break
>>> print("ooo!")
!

, . ,
support_c.
,
. , .
,, , .
,
.
Apiori ,
, (
minsupport ).

.
- .
,

,
Yi.>, : "Boiiy ",
" ".>. ,
( , , ),
, ,
, Yi.> ,
<< , Yi.>, .
,
: , , II Z, , II .
, ,
.
,
11111 . .
. .1 1, ,

, , .
.
:

lift(X-? )= ( I )

()

() - , :.1 ,
( YIX) - , ,
.
. (), P(YIX)
, , ,
. ,
10, 100.
:
>>> minlift = 5.0
>>> nr_transactions = oat(len(dataset))
>>> for itemset in freqsets:
for item in itemset:
frozenset([item])
consequent
antecedent = itemset-consequent
base = .
# acount: n
acount = .
ccount :
ccount = .
for d in dataset:
if item in d: base += l
if d.issuperset(itemset): ccount +=
if d.issuperset(antecedent): acount += 1
base /= nr_transactions
p_y_given_x = ccount/acount
lift = p_y_given_x / base
if lift > minlift:
print('Y n () -> (1) (2)'
.format(antecedent, consequent,lift))

.
- , ,
,
( ), ,
( ), ,
.

EE':J
1

1378, 1379, 1380 1269

8.

279(0.3%)

80

57

225

48,41,976

117

1026 (1.1%) 122

51

35

48,41,1,6011

16010

1316(1.5%) 165

159

64

, , 80 1378, 1379 11 1380


. , 57
1269, 57/80::::: 71%.
, 0.3% 1269,
255.
- ,
,
.
, ,

.
, 11
; 1030 (
80 , 5).
, .
,
, .

,
r.

,
ii. ,
,
100 . 1r,
. , ,
, r
.
11r r
11, , -

:DI

. , , ,
, ,
.
.
, 1 , .

11
. , ,
11 ,
. ,
, .
,
;
.

:
.
() ,
, ,>.
, ,
. scikit-learn
,
.
,
(
?). ,
,
r .

: . r :\I
: .
- .

r.nABA 9.
ll11
no 1.n
1, :
I-Iaop 011, .
, I-Iaope Iis ,
11 . ,
,
u
111 , , 1111
.
u
11 -. 1 111111 011;1I-111 I-I.
? , J -1111?
, , :.1 ,
- << >
. 11 , -
111,
.


, ,
, I-Ili, .
11, ,
. 110
, :-.1
ii. II
1111.
, 111111 -ii )"Ii\
, 1 .
- 111 11:1.1 :
, 1, 11, , , 11 . .


GTZAN,
r
. 10 ,
: , , 11, , .
30 100
. ,1 http: //opihi.
cs.uvic.ca/sound/genres.tar.gz.

"h
'-'()


22 050 (22 050 )
WAV.

WAV
,
-,
11. , -
, ,
. ,
10 .
.
WV-,
scipy.io. wavfile.
-,
.
,
SoX ht tp: //sox.sourceforge.net.
,
.

WAV -
J\111 SciPy:
>>> sample_rate, = scipy.io.wavfile.read(wave_filename)

, sample_rate - ,
. ,
.

9. 1

, <<
,> , -
. -
.
, - . ,
.
matplotlib spegram ,
:
>>> import scipy
>>> from matplotlib.pyplot import specgram
>>> sample_rate, = scipy.io.wavfile.read(wave_filename)
>>> print sample_rate, X.shape
22050, (661794,)
>>> specgram(X, Fs=sample rate, xextent=(0,30))
WV-
22 050 661 794 .
30
WV-, l\<t ,
.
classical song 1

classical song 2
classical song
12
11,
lOk:
101,
IH
61(
6k
6
4v
4'
4k
'or...:....:....i.....1..u111,;;:ii.'11!..i...;;;.:J1 2aWIIIW"a......,.....:.;_.;.._...., ,,._.._......-............,1..
10 15 20 25 30

JO 15 20 15 30
10 1 25 .30 O
jazz song 2
jazz song 1
jazz song
12 k
12<
10 k. 1
I
8 k.
6k
12

'U
6k

<

igt
.,,.
j

t ___.,....._,.,..._L ,._., ..1._..a

-!11:iU::::lf.J..i.::Jf

.._

_...

5
10 15 20 25 30
5
10 20 25 30

5
10 15 20 25
country song
country song 1
country song 2
",
12,
12,
10 k
lOk.
lOk '
R
8'
8],
61,;
6k
IJ k I
lr
_.lr
41.
'11.:........i::....._;;;..i 25 ...................__.................i:i
_.. 10

10 15 )(.1 :!5 iO

:?5
:? 25
song 1
song
1:12k
10 k
10 < ... 1
Bk
t,
6'
4J...,
r [! . .. \ 1 .,.\
wt :.......:.. -- - :..ii.:.:.:....1..;;.: 1...1, 2k,

5 10 1 S 20 25 ]
rock song 1
12,
12 k
lOk
Hl 1<
I
k
1
6\;
4\(
4k
2 .................................... .....-.-.................__...,

10 1 2Q 25

10 1 1/J 25 JO
metal song 2
metal song 1
10
8
6k
81(
6
2rilll:i..""":.......llr.;.......a..;J""";,;:i
!tJ J5 20 2"1 )

10 15 10 2. 30

1t

lH

't ................_................._.......

.'

, ...

"

t .,

...............................

30

1ED1

, 1(
, , .
,;,>
( ,
!),
.
,
, .
, , .
, ,
. .

,
( )
.
(). , ,
, ,
.
, .
, , WV-, sine_a.wav sine_b.wav,
11 400 3000 .
ii ,;i1
SoX:
$ sox --null -r 22050 sine_a.wav synth 0.2 sine 400
$ sox --null -r 22050 sine_b.wav syr.th 0.2 sine 3000

0.008 .
- . ,
400 3000 .
, <1 400
, 3000 - :
$ sox --comine mix --volume 1 sine_b.wav --volume 0.5 sine_a.wav
sine mix.wav

r ,
, 3000 ,
400 .


le9
2_0
1.5
1.0
0.5
.
-0.5
-1.0
-1.5
-2.0
.

3_5
.
2.5
2.0
1.5
1.0
0.5
.

2_0
1.5
1.0
0.5

0.002

9.

400Hz sine wave

0.004

0.008

0.006

t,me [s]

FF of 400hz sine wave

lel2

1000

3000

2000

freq:>ncy [ Hz J

4000

,z sine wave

le9

-0.5

-1.0
-1.5
-2.0
.

3.5
3.0
2.5
2.0
1.5
1.0
0.5

0.002

0008

0.006

t,me [s]

FF of 3,000hz sine wave

112

0.004

1000

2000

3000

frequency [Hz]

4000

Mixed sine wave

-1

-2
-3

.S
3.0
2.5
2.0
1.5
1.0
0.5
.

0.002

0.008

0.006

t1me [s]

FF of mixed sine wave

lel2

0004

lCOO

20

3000

f1eq,1e11cy [Hz]

4000

111111

...

, 11 ,
.
some sample song

6000
4000
2000

-4000
-6000
.

0.002

. le7

0.004

t1111e [s)

0.008

FF of some san1ple song

2.5
2.0
1.5
1.0
0.5

. --------------1000

3000
2000
freque11cy [Hz]

40UO

,
.
- ,
,
.

,
, .
- ,>,
, .
,
.
,
. -

........

9. 1

, WV
. create_fft(),
scipy.fft() . ( !)
1000 . ,
, ,
; ,

.
, .
import os
import scipy
def create_fft(fn):
sample_rate, = scipy.io.wavfile.read(fn)
fft_features = abs(scipy.fft(X) [:1000])
base_fn, ext = os.path.splitext(fn)
data- fn = s - fn + ".fft"
scipy.save(data_fn, fft_features)
N save ,
.npy.
WAV -,
.
read_fft():
import glob
def read_fft(genre_list, base_dir= GENRE_DIR):

[]
[]

for label, genre in enumerate(genre_list) :


genre_dir = os.path.join(base_dir, genre, "*.fft.npy")
file_list = glob.glob(genre_dir)
for fn in file list:
fft_features = scipy.load(fn)
X.append(fft_features[:1000])
.append(label)
return np.array(X) , np.array(y)
,
:
genre_list = ["classical", "jazz", "country", "", "rock", "metal"]

...


,1
, u 6. ,
- ii
rii 11111,
.
,
.
, 50% -
, ' 1
. rii 50%
. 6 ,
16.7% ( ,
).



r
ru ,
. ,
.
, :
>>> from sklearn.metrics import confusion_matrix
= confusion_matrix(y_test, y_pred)
>>> print(cm)
((26 1 2 2)
[ 4 7 5 5 3)
[ 1 2 14 2 8 3)
[ 5 4 7 3 7 5)
[ 10 2 10 12]
[ 1 4 13 12)]
>>>

,
.
. ,
6 6. , 31
( ) 26
, 1 - , 2 - 2 -
.
. , 26 ,

9. 11 1

5 - . . :
24 7 29%.
, 1,
, ,
.
,
( ) 1 (
).
,
NttPy. matshow
atplotlib:
from matplotlib import pylab
def plot_confusion_matrix(cm, genre_list, name, title):
pylab.clf()
pylab.matshow(cm, fignum= False, cmap= 'Blues',
vmin= O, vrnax= l.O)
= pylab.axes()
ax.set_xticks(range(len(genre_list)))
ax.set_xticklabels(genre_list)
ax.xaxis.set_ticks_position("bottom")
ax.set_yticks(range(len(genre_list)))
ax.set_yticklabels(genre_list)
pylab.title(title)
pylab.colorbar()
pylab.grid(False)
pylab.xlabel('Predicted class')
pylab.ylabel('True class')
pylab.grid(False)
pylab.show ()

( cmap matshow ()),
, ,
.
,
j et ired.

l\l :

...

1., 1

:;i
,:,:
:;;
:,:
:,:
:,:

t;
:s:

co,intr1
pc::i

rod\

..!

111ClI

c:<1ss1cal jazz coutry

roc"

rntal

1.i


,
, - .
,
.
( ). , , , ,
.
, -
( - ),
1 . ,
(
1000). ,
. ,
, .
,
.

.

m1

9.

,
. , ,
- (/).
/
(),
, .
, /
,
, ,
, . -
. ,

.
,

(. ).
/0 .(U=.!)) / 1

KpIIJle.PXf\ (U=.0,68,) /_ ..11.1i _

,,
.....
u
:,:

,,

,,

,,

1 /.
:
v
, 11 1 (1 ,
(AUC) 1.0.
1< .
i\ 111

1111

neporo...

.
: ,
. ,
, ,
( ),
AUC 0.5. AUC
/ .
,

/ , ,
. .
Davis,
Goadrich << Tl1e Relationship Between Precisio-Recall d ROC Curves,>
(ICML, 2006).
/ .

--
TP + FN

FPR=
FP+TN

=
TP+FP
TPR=
TP+TN

,
(TPR)
/.
(FPR)
,
, (
), - .
, ,
,
.

,
. ,
- / - ,
.
, -

1m

9.

, "11111 11 1111
.
from sklearn.metrics import roc_curve
y_pred = clf.predict(X_test)
for label in labels:
y_label_test = scipy.asarray(y_test = =label, dtype = int)
proba = clf.predict_proba(X_test)
proba_label = proba[:,label]
# :n ,
#
fpr, tpr, roc_thres = roc_curve(y_label_test, proba_label)
# tpr fpr ...

)"t , 1111
. 11,
.
, :-.
. 11
- ,
.

, - 111 111,
,
1111
. - .
1111, ,
. , -
II 11, . . ii
, , r
. :\-1 , 11

w ...
(nlF.93)/n

,
,,

11111 IE!IJ

(n=.73)/

,,

,,

,,

,'

(lF.68)/

,,

,,

,'

(n=.26)/

,'

',
''

,,
',

,,
,'
n

(lF.57)/

,'

,'

,,

(=.61)/

,'

',

,'

,,

(lnternational Society for Music Infoation Retieval, ISMIR).


,
(Autoatic Music Genre Classification, AMGC) -
.
AMGC, ,
.

m1

9.

,
- .
(Mel Frequency Cepstrn, MFC) -
,
, {1 .
.
, , <<> -
<'>. MFC
. ,
.
, , - ii
Talkbox SciKit.
https://pypi.python.org/pypi/
scikits.talkbox. mfcc(),
MFC:
>>>
>>>
>>>
>>>

from scikits.talkbox.features import mfcc


sample_rate, = scipy.io.wavfile.read(fn)
ceps, mspec, spec = mfcc(X)
print(ceps.shape)

(4135, 13)

ceps,
13 (
nceps mfcc <)) 4135
fn.
.
. ,
,
, 10
.
= np.mean(ceps[int(num_ceps*O.l) :int(num_ceps*0.9) ], axis = O)

, , ,
30 ,
10 .
- ,
.
, ,
MFC,
.
ik :

...

11111 JI

def write_ceps(ceps, fn):


base_fn, ext = os.path.splitext(fn)
data_fn = base_fn + .ceps
np.save(data_fn, ceps)
print(Written to %s % data fn)
def create_ceps(fn):
sample_rate, = scipy.io.wavfile.read(fn)
ceps, mspec, spec = mfcc(X)
write_ceps(ceps, fn)
def read_ceps(genre_list, base_dir = GENRE_DIR):

[]'

[]

for label, genre in enumerate(genre_list):


for fn in glob.glob(os.path.join(
base_dir, genre, *.ceps.npy)):
ceps = np.load(fn)
num_ceps = len(ceps)
X.append(np.mean(
ceps[int(num_ceps*O.l):int(num_ceps*0.9)], axis= O))
.append(label)
return np.array(X), np.array(y)


, 13
(. ).
. ,
-
1.0.
. r,
,
.
(. ).
,
, :
. , ,

. ,

,
. - ISMIR -

(Auditory Filterbak Temporal Envelope, AFTE), ,
, MFC .
, ?

tm

9.

(=.99)/

(=.83)/

,'

,'

,'

,,

,,

,'
',

,,

,,

(=.87)/

,'

,,

,
,,

,,

(=.97)/

<

....

....

"

''

>-

iz..,!

>-

..

z ...

8.

8.

,'

(=.86)/

'

,,

'

(=.99)/

,,
,'

,'
,,

,,

,'

,,

'

,,

,'
.n

,'


CEPS

:;:

,:,:
:;;
:,:
:,:
:,:
u
:s:

...

.11,

C,Jr'\I)

rock
,ctal

'1

<..1,.t$iCal

J.:tZZ

L-: !ry'

roci<.

nctal

, , ,
,
, , .
, 1,
, . , ,
, ,
-
, ,
.


. ,
,
, .
MFC
.
, ,
, .

9.

, - . 11 ,
, 1111
.
.
, r - ,
.

,
.
11,
.
ahotas n-
.

rJIABA 10.

u II ;1
.
11
,
. , u
,
.
, 13

u. 1,
111altas 1111 .

11.
. ,
-
.
, 13 1111
. ; it , {,
, SIFT (scale-invariant t"eatre trast"on -11111 ; ),
1999 . u
.

, -
. -
,
.

m1

1 .

- , 0110
, PNG JPEG.
, PNG
, JPEG - , 110
.
(,
).
.

( - ) .
111111 ,
11,
.



ahotas. https: ! /pypi.python.
org /pypi ! , - - http: ! /mahotas.
readthedocs.org. Mahotas - (
MIT,
) , .
, NuPy. N,
, .
;-.1, scikit-i111age (ski111age),
ndiage (- ) SciPy Python
OpenCV. Nu111Py,
11 1
, 1-i'l.
altas,
mh, :
>>> import mahotas as mh
11.1 1111 imreact:
>>> image = mh.imread('sceneOO.jpg')
sceneoo .jpg (
1r 11111)
h II w;

1111

( h, w, ). u - , - ,
- : , .
,
N .
np.uint8 (8- ).
,
.
, -
, :
( n-
), 12- 16-. Mahotas
, .

,
.
Mahotas -
. ,
( , ).
PNG JPEG .
, ,
,
mahotas.

,
atplotlib, :
>>> from matplotlib import pyplot as plt
>>> plt.imshow(image)
>>> plt.show()

, ,
- , - .
ru-. Python

: mahotas NttPy,
matplotlib; ,
scikit-learn.

n 1 .

- : ,
, ,
- (, ,
rue False). ,
. mahotas
. Otsu,
. ,
rgb2gray
mahotas.colors.
rgb2gray
, , image.
mean(2). , rgb2gray

.
.
>>> image = mh.colors.rgb2grey(image, dtype= np.uint8)
>>> plt.imshow(image) #

matplotlib
:

&111

, - .
. :
> plt.gray()

. ,
,
.
:
>>> thresh = mh.thresholding.otsu(image)
>>> print('opo {}. '.format(thresh))
138.
>>> plt.imshow(image > thresh)


138, :

, , ,
II
. mahotas :
> iml = mh.gaussian_filter(image, 16)

,
,

m1

1 .

,
. gaussian_filter -
( ). ,
( . ):

,
:

m1

, , NumPy
, .
:
>>> im = mh.demos.load('lena')

,
:
>>> r,g,b = im.transpose(2,0,l)


mh.as_rgb.
, ,
8- ,
RG-:
>>>
>>>
>
>>>

rl2 = mh.gaussian_filter(r,
gl2 = mh.gaussian_filter(g,
2 = mh.gaussian_filter(,
iml2 = mh.as_rgb(rl2, gl2,

12.)
12.)
12.)
2)

.
w,
- :

m1

1 .

>>> h, w = r.shape II
>>> , = np.mgrid[:h, :w]

np.mgrid,
(h, w>,
. :
>>>
>>>

- h/2. 11 h/2
/ Y.max() # -1 .. +1

>>>
>>>

- w/2.
/ X.max ()

,
:
>>> = np.exp(-2.*(**2+ **2))
>>> # : ..1
>>>
- C.min()
>>> = / C.ptp()
>>> = [ : , : , None] #

, - NumPy,
mahotas . , ,
,
:
>>> ringed = mh.stretch(im*C + (l-C)*iml2)

m1

,
. : ,
. 30
,
. ,
, .

GitHub. ,
, .

( ).
- . ,
,
, . ,
(
) . ,

.
<$ >,
7 <,,>.
.
, , ,
.
.
.
. ,
(
).
.
.

1 .

Mahotas .
mahotas. features.
Haralick.
,
. ,
111-1 ,
. ahotas
:
>>> haralick_features = mh.features.haralick(image)
>>> haralick_features_mean = np.mean(haralick_features, axis= O)
>>> haralick_features_all = np.ravel(haralick_features)

mh.features.haralick 4 13.
- ,
(, II ).
,
(
haralick_features_mean).
( haralick_features_a11).
. ,
II
, haralick_features_all.
mahotas .
,
. , ,
11 .
, -
,
:
>>>
>>>
>>>
>>>
>>>

from glob import glob


images = glob('SimpleimageDataset/*.jpg')
features = []
labels = []
for im in images:
labels.append(im[:-len('OO.jpg')])
im = mh.imread(im)
im = mh.colors.rgb2gray(im, dtype = np.uint8)
features.append(mh.features.haralick(im).ravel())

>>> features = np.array(features)


>>> labels = np.array(labels)

111111

.
,
( ,
).
.

. , ,
, .
r
:
>>>
>>>
>>>
>>>

from sklearn.pipeline import Pipeline


from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
clf = Pipeline([('preproc', StandardScaler()),
('classifier', LogisticRegression())J)

,
:
>>> from sklearn import cross_validation
>>> cv = cross_validation.LeaveOneOut(len(images))
>>> scores = cross_validation.cross_val_score(
clf, features, labels, cv = cv)
>>> print(': (: .1%}'.format(scores.mean ()))
: 81.1%

81% - (
33%). ,
.

. ,
.
. ,
,
. ,

.
.
, <1
- .
, altas, -
. , .

m1

1 . w

,
, ,
.
RGB,
: R (), G () (
). - 8- ,
17 . 64,
. ,
:
def chist(im):

,
64 (1:
im = im // 64

3,
64 .
ii, :
r,g,b = im.transpose((2,0,l))
pixels = 1 * r + 4 * + 16 * g
hist = np.bincount(pixels.ravel(), minlength=64)
hist = hist.astype(float)

ii ,
. , ,
. np. loglp,
log(h+1).
( , NumPy
).
hist = np.loglp(hist)
return hist

:
>>> features = []
>>> for im in images:
image = mh.imread(im)
features.append(chist(im))

, ,
90%. 1, 11
:
>>> features = []
>>> for im in images:


imcolor = mh.imread(im)
im = mh.colors.rgb2gray(imcolor, dtype = np.uint8)
features.append(np.concatenate([
mh.features.haralick(im) .ravel(),
chist (imcolor),
] ) )

u 95.6%:
>>> scores = cross_validation.cross_val_score(
clf, features, labels, cv = cv)
>>> print(': (: .1%}' .format (scores.mean()))
: 95.6%

, -
.
,
scikit-learn. - -
.
.

,
,
11. ,
, (
).
11 , ,
:
. , - 611
.
, .
,
. ,
, ,
.
,
.
:
>>> features = []
>>> for im in images:
imcolor = mh.imread (im)

1 .
, 200
t
imcolor = imcolor[200:-200, 200:-200)
im = mh.colors.rgb2gray(imcolor, dtype=p.uit8)
features.apped(p.concatenate([
mh.features.haralick(im).ravel(),
chist(imcolor),
] ))

>>>
>>>
>>>
>>>

sc = StandardScaler()
features = sc.fit_transform(features)
from scipy.spatial import distance
dists = distance.squareform(distance.pdist(features))

(
), ,
>> :

>>> fig, axes = plt.subplots(2, 9)


>>> for ci,i in enumerate(range(0,90,10)):
left = images[i]
dists_left = dists[i]
right = dists_left.argsort()
right[O] - , left[i], t> ceyl(llIOi
right = right[l]
right = images[right]
left = mh.imread(left)
right = mh.imread(right)
axes[O, ci] .imshow(left)
axes[l, ci] .imshow(right)
:

, ,
, , ,
. , ,
, .

11
,i. , ,
, .
,
.
,
: ,
: , ,
.
, ,
. ,
.
.
:

- :

1111

: 1 .

,
. ,
. -,
1. 11 scikit-lear
r
. 1110 = 1. , 1
.
:
>>>
>>>
>>>
>>>

frorn sklearn.grid_search irnport GridSearchCV


C _range = 10. np.arange(-4, 3)
grid = GridSearchCV(LogisticRegression(), pararn_grid= {''
clf = Pipel ine([('preproc', StandardScaler()),
('classifier', grid)])

C_range}}

:
1;1 1.
, ,
:
>>> cv

cross_validation.KFold(len(features), 5,
shuffle = True, randorn _state = 12 3)
>>> scores = cross_validation.cross_val_score(
clf, features, labels, cv = cv)
>>> pr int('Bepoc: {:.1%}'.forrnat(scores.rnean()))
: 72.1%
=

, ,
, . ,
, tr ,
.


.
-
,
. ahotas 11 SURF
(Speeded Up Robust Features -
). , -
SIF.

1 11 II (
).
, .
:
;
;
(,
< ,>).

. ahotas .
,
, 111
.
.
ahotas :
surf.surf:
>>>
>>>
>>>
>>>

from mahotas.features import surf


image = mh.demos.load('lena')
image = mh.colors.rgb2gray(im, dtype= np.uint8)
descriptors = surf.surf(image, descriptor_only = True)

ctescriptors_only = True ,
, , .

, surf.dense:
>>> from mahotas.features import surf
>>> descriptors = surf.dense(image, spacing=l)

,
, 16 ii .
,
.
() - 64, . ,
.
,
.

, r perpecc1111 -
.
. ,

........

1 .

,
.
.r111
, n1111 l\111 .
- -
.
2004 . <, -
: 111, ,
.
,
. ,
, , .
,
-.
. ,
.
, 111111
. (
), r
.
:

.

. ,
(10-20 ),
.
(, ,
) .
,
. 256,
512 , - , - 1024
.

:
>>> alldescriptors = []
>>> for im in images:

im = mh.imread(im, as_grey=True)
im = im.astype(np.uint8)
alldescriptors.append(surf.dense(image, spacing=l))

t1

>>>#
>>> concatenated = np.concatenate(alldescriptors)
>>> rint(' : (}' .format(len(concatenated)))
: 2489031

2 .

. ,
:
>>># 64-
>>> concatenated = concatenated[::64]
>>> from sklearn.cluster import KMeans
> k = 256
>>> km = KMeans(k)
>>> km.fit(concatenated)

, km
.
:
>>> sfeatures = []
>>> for d in alldescriptors:
= km.predicttd)
sfeatures.append(
np.array([np.sum(c == ci) for ci in range(k)J)
>>># float
>>> sfeatures = np.array(sfeatures, dtype=float)

sfeatures[fi, fj J
, fj t ti.
,
np.histogram, -
.
,
.

, (
, 256). ,
:
>>> scores = cross_validation.cross_val_score(
clf, sfeatures, labels, cv=cv)
>>> print('Bepoc: (:.1%}'.format(scores.mean()))
: 62.6%

, ! , ?

1111

1 .

, ,
76.1 %:
>>> combined = np.hstack([features, features])
>>> scores = cross_validation.cross_val_score(
clf, cornbined, labels, cv = cv)
>>> print('Bepoc: {:.1%]'.format(scores.mean()))
: 76.1%

- ,
. ,
SURF

.

11
1< 11 :
r I< 1111 ,

. , ,

. -
.

.
, ii, ,
, , ,
,
.
caoii,
.
mahotas,
Pytln. , .
Skimage (scikit-image) ,
. OpenCV - , r ++
Pytho. Nt1mPy,
,
.
:-., 1:.1
: 11:-.1 :-,111. 1111,

1111


.
u1111 ,
. , .

r 11.

, -
,
,
. , ,
,

, , , .
:
,
. ,
, ,
, ? ,
,
?
n,
.
,111
. (,
, ).
.
,
.
, , ,
,
.
= =
.
= {, .
., ,
.

11m

, , ,
u .



: .
l\1 , ,
, .
,
, 11 ,
1 11
. -
.
,
, , II
.
(principal coponent aalysis, ),
(linear discrimiant analysis, LDA)
(multidimensional scaling, MDS).


,
,
. ,
.
.
- ,
( ).
. -
,
- , ,
.
.
-
1000 1000 .
: .

11.

- ,> 11 ,
.
11
. ,
,
.
.


x1,x2,...,xN

.r2, 7,...,

.r2, 10, 14

.
,
,
(11 ).
(1") 01'( 1 , Xz)
pearsonr() scipy.

stat.

,
, -.
- ,
. ,
-, .
>>>
>>>
>>>
>>>
>>>

from scipy.stats import pearsonr


pearsonr([l,2,3], [1,2,3.1])
(0.99962228516121843, .017498096813278487)
pearsonr([l,2,3], [1,20,6])
(.25383654128340477, .83661493668227405)

,
. " ,

........

- 0.84, ,

.
.
Cor(X 1 , ) = . 999

-,

./

10

12

-2

10

\"

Cor(X,. ) = 0.787

Cor(X1 , ) = 0.070
20
10

....

, .....

-10

t" :

-20
4

10

12

12

. . ..
.
.
. . .... . .. . . .
.. - ...

...

-,

.\"

..
........ . ..
.........
.. . . . .......,..,_..,. . ... ..
.
. - '..... .

.1 - - :.. - - - - :- - .- ; ..

10

.\"

,
, , , 2 ,
, , , .
. ,
, -.
,
.
,
(,

D1111

11.

).
, , .
.
Cor(X 1 , 1 ) = -0.078

16

14

14

12

12

. ..

10

8
6

-6

.. ..

10
1

- - --;:. ... - - ... - - -.. .:. - -4

. .

. .

-2

-6

.\"

Cor(X t . ) = -.064

14
12
10

-2

-6

..
.

.. .

..

20
10

... - - - - ._...- -.. ...: .......-.. .


. . ..

- - _._
--1

":'"

-2

.\"

..

-10
-20

.
.
-. . .. . - - -. - - -. - --;

Cor(X 1 , 1 ) = -0.071

-4

...... ..-,"

....

-2

......

.\"_

..

.- _,_ . . ... .
.
.. . .
...

-4

..

..

.,._.: -- - -.- -q-- ..

-2

.\"

, , 2
, ,
. ,
, .
,
. ,
,
2 ,.
.

1111

,
.

,
, (
). , <
( , ).
, ,
house_size ( ), number_ f_lev
els ( ) avg_rent_price (
), , ,
. , house_size,
number_of_levels ,
. avg_rent_price - ,

.
.

, i1 .
,
, . , ,
.
. ,
- - ,
. . 50%.
,
:

() = LP(X;)log2p(X;)
i=I

: 11 -
, - , (0) = (,) = 0.5.

() =-p(X0 )log 2 p(X 0)- p(X1 )log 2 p(X 1) =


= -0.5 log 2 (0.5)- 0.5 log 2 (0.5) = 1.0

........

11.

scipy.
stats. entropy ( [. 5, . 5], base = 2). base
( ) 2, , .
, np. log (). ,
( ).

, ,
: 60%:

() = -0.6 log2 0.6 - 0.4 log2 0.4 = 0.97


, .
0.5 ,
100%.
.
()

1
08

:-:

04

02

02

04

06

( = )

08

10

(),
, ,
, . ,
, .
, - ,
, -
. , ,
(, ,
).

:

(;, 1.)

/(; )= P(X;,Y)log 2
(; )P(Yj )
i=I j=I
, ,
. , (),
, ,
.
1 .
[0,1],
.
:

Nl(XY
' )=

/(; )
()+( )

, ,
, .
Nl(X,, ) = 0.290

Nl(X,, ,) = 0.252

;/,

/
/ :
;.

-2

.....
/.,:-

"'

/
.

10

..

....
,. .
.
...
..

12

-2

10

12

Nl(X 1 , ,) = 0.079

-2

.. .
.
..
..
.. .... . .......' .. ... .... ..
. . .. . .
..
'

10

. ..

!G

;;,

._

-10
-20
12

. . . . ... .. .
..

..
fll

10

11.

,
. ,
.
Nl(X,, .\" 1)

16
14
12
10

:..-:
6

...
-4

16
14

12
10

-6

Nl(X 1 2 ) = 0.287

16
14

.
.. .
.

12
10

.., ......

-2

/)

..
..
..
-4

....
.....
-2

... "

'

'

Nl(X, .\" 1 ) = 0.223

Nl(X,. 1) = 0.107

.....

-2

..

= 0.287

-4

-.
.
..
-.. .... , .......

..

-2

.\'

..

20
10

...,
-10
-20
(,

.. ... . . . . .
.. ... .. .
.. . . . .
-.i

-2

..

.\

,
. -
( ,
), .
,
.
, .
, - ,
, (. .
).

, -

1111

1. ,
:\l,
. , .

, ,
, .
, , ,
- ,
. ,
XOR,. ,
, , ,
, , :


? ,
-.

;

.

n.
x1, ...,zN

..
nv
n,
1
Dt 08

m1

11.


. ( ),
-, .

, , ,
.
scikit-learn, sklearn.feature_
selection, -.
RFE, recursive
featre elimination,> ( ).
- ,
, ,
. RFE
, ,
.

t 100 , make_
classification . , 1
,,
:
>>> from sklearn.feature selection import RFE
>>> from sklearn.linear_model import LogisticRegression
>>> from sklearn.datasets import make_classification
>>> , = make_classification(n_samples = lOO, n_features= lO,
n_informative = , random_state = O)
>>> clf = LogisticRegression()
> clf.fit(, )
>>> selector = RFE(clf, n_features_to_select = )
>>> selector = selector.fit(X, )
>>> print(selector.support_)
[False True False True False False False False True False)
>>> print(selector.ranking_)
(4 1 3 1 8 5 7 6 1 2)

, , , n_:features_to_select. ,
. ,
, ,
.
, II .
n_features_to_select II
, support_ ranking_.


n_features_to_select support_

ranking_

[False False False True False False


False False False False]

(6351107924)

[False False False True False False


False False True False]

(5 2 4 1 9 6 8 7 1 3)

[False True False True False False


False False True False]

(4131857612)

[False True False True False False


False False True True]

(3121746511)

[False True True True False False


False False True True]

[2111635411)

[ True True True True False False


False False True True]

(1111524311)

[ True True True True False True False


False True True]

(1111413211)

[ True True True True False True False


True True True]

(1111312111)

[ True True True True False True True


True True True]

(1111211111)

10

[ True True True True True True True


True True True]

(1111111111)

, ii. , ,
,
. , , ,

, .


. ,
(
). ,
caoii .
,
,
<<>> .
, , ,
(L1-).

11.

!

.


,
:
. , ,
11, ,
. , <-',
>.> - ,
, . ,
, , -
.
.
1 .
, ,
u ,
.
,, 11111-1
. 110
: 11 (11ii) 11
(111111ii). II
, m 11
.


(principal componct analysis, )
1,
, ,
. ,
,
. ,
, 11 ,

II l\1 , - 11
, 11
t111.

t11

, ,

:
;
(
).
,
.
.

, . .
1. , .
2. .
3. .
N ,
, N- ( ).
, , , ,
.
N = 1000 , ,
, 20.
20
.

,
:
>>>
>>>
>>>
>>>
>>>

xl = np.arange(O, 10, .2)


2 = xl+np.random.normal(loc = O, scale=l, size=len(xl))
= np.c_[ (xl, 2)]
good = (xl>S) 1 (2>5) # -
bad = -good #

scikit-learn , ctecompo
,
. ,
n_components:
sition.

>>> from sklearn import linear_model, decomposition, datasets


>>> = decomposition.PCA(n_components= l)

D1111
14

10

-- ----

..
.
.
.
.
.
...
.
.
. ....,.,.. .

12

11.

,.,

-2
-2

\".

10

lL

-4

-2

.\'

, fit transform
( fit_transform()),

:
>>> Xtrans = pca.fit_transform(X)
, Xtrans 1.
.
,
.
, ,
, :
>>> print(pca.explained_variance_ratio_)
>>> [ 0.96393127]

,
- 96%.
, .
.
n_components,
. explained_variance_ratio_
. -
,
, -
. . , ,
:
.


,
:>,
, .


http: //scikit-learn.sourceforge.net/stale/
auto_examples/plot_digits_pipe.html.


LD
, ,
. ,
, , Kernel ,
-
.
,
,
. good = (xl > 5) 1 (2 > 5)
good = xl > 2, , .

10

... . ...

-2

12

.,. ., ...
.,.

10

-2
_\

12

-5

.\'

,
. ,
.
, - .

........

11.


(LDA). ,
,
,
. 1111 , ,
:
>>> from sklearn import lda
>>> lda_inst = lda.LDA(n_components = l)
>>> Xtrans = lda_inst.fit_transform(X, good)

. ,
fit_transform() . ,
- , LDA -
. , :

12
10

.. .. . -
... --

"'"

1<#

-2

-2

.\

10

12

-6

-4

-2

.-:

,
LDA ? .
, 11
, LDA .
, ,
LDA. , , ,
.


, (MDS) -

1111D

,
. ,
.
MDS ,
,
. MDS N k-
,
d0 ,
( , ):

MDS
, '
. MDS l'
, 2 3.

. ,
,
.
>>> = np.c_[np.ones(S), 2 * np.ones(S), 10 * np.ones(S) J .
>>> print(X)
1.
1. 1.]
1.
1.
[ [
2. 2. 2. 2. 2.]
[
[ 10. 10. 10. 10. 10.]]

s manifold scikit-learn
,
:
>>> from sklearn import manifold
>>> mds = manifold.MDS(n_components= )
>>> Xtrans = mds.fit_transform(X)

l\1
n_components.
.
, - .

121

11.


MDS



MDS

-2

-2

-4

-,10-8 -6 -4 -2

-u

6 -8

-8

-10

-6

-4

10

-2

12

Iris.
LDA . Iris
.
,
( )
n .
, n n
. , , <1
,
DS- ,
<1,> , :
IRIS

MDS

IRIS

MDS
15
1.0
05

.5

-1

l.O

-J1

4 -3 -2 -1

i.5
1

-2.0

-3

-4

..... :. "'

-2

-1

111111


, , II ,
, .
IRIS

..........

..

''
!

IRIS

0.5

j'"

0.4
0.2

-0.2
4

-4

-3

-0.5
-1.0
4
1 5
S-2.0

01.

1
05

89
1.0

...

...

00

- 5

-10

-1 5
-2.0
-4

. ":::. ..
... 1... ; ...
.. . ..
,,,. -- ,.
"',. .......-:\ .
..

-3

-2

.........

....

... ... ...

-l

,:

, MDS 11111,
; ,
. ,
, <;
( - , 1 - , 2 = . .), (
, ?).
, MDS
,
,
.
, , MDS -
, ,
. . ,
, , MDS ,
manifolct .
,
,
. ,
.

,
, .

1m1

11.

,
,
, ,
, .
,
. ,
, ,
. ,
- , ,
.
Jug,
Python ,
.
AWS, A111azon.

r 12.
r 6011w
, 1.
: ,
, .
,
.
, .
?
, :\1 ,
. ,
,
. ,
, .
, ,
, <<,> (
, ).
jug, :
;
( ) ;
,
grid-ce.
- - ;
, .
, Amazon Web Services
thn- StarCluster .


<< -
. ,
, , -

12. r

. , ,
. .
,
,
,
.
,
ii .

.

, .
:
(
) 11 .
II .

jug

:
, : II
.
jug , 11 11
. (
MIT),
, .
1.
(emoize)
( ),
, ,
, .
11.1111
1<n, . jg
,
:.1 01, PBS (Porta\e Batch System),
LSF (Load Sharing Facility) Grid Engine.
,
II .

jug
- jug.
. ii
:
def doue () :
return 2*

,
> .
$.
<" double .
- <" doue 642.34>. jug
:
from jug import Task
tl = Task(doue, 3)
t2 = Task(doule, 642.34)

jugfile.py ( thn-).
jug execute,
11 , Python,
$:
$ jug execute

, 111 Uug ,
ctoue ). jug
execute, jug , .111 11r!
. 11 rrero ,
1111,
.
, ,
jugfile.jugdata, .
. .1, jug execute
.
,
, ,
( ,
, ,
, ).
, Haskell,
,
.

1111

12. 1

jug
. ,
.
:
.
.
(
).
. jug (jug execute --debug),
.,
.
, 011 .
ask < function,
arg ument). Python
:
from jug import TaskGenerator
from time import sleep
@TaskGenerator
def doue():
sleep(4)
return 2*
@TaskGenerator
def add(a, ):
return +
@TaskGenerator
def print_final_result(oname, value):
with open(oname, 'w') as output:
output.write('Final result: (}\n'.format(value))
doule(2)

z = doue()
2 = doue(7)
z2 = doue(y2)
print_final_result('output.txt', add(z,z2))
askGenerator,
tln-!
askGenerator ,
, .
,

11111&1

, ask. ,
1.<, '
1, r.1 .
, sleep(4).
.
,
.
1 jug status,
:
Ready
Running Task name
Waiting
Finished
1
1
2

jugfi1e.pri11t_fina'\._esutt
jugfi1e.add
jugfi'\.e.doub'\.e
Tota1

(
):

&

$ jug execute &


$ jug execute &
jug status:
Waiting
Ready
Finished
Running Task name
2

jugfi'\.e.print_fina'\._resu'\.t
2 jugfi'\.e.doue
jugfi'\.e.add
Tota'\.

, doue .
8 out put.t xt.
, jugfile. , - ,
. ,
analysis . , :
$ jug execute analysis.py
, ,
jugfile. , 11
.

r
jug? , .
ask II . -

........

12. r

r , .
,
( ,
).
, jug
. ,
. jug execute
, ,
-.

He-r


I
f


( jugfile. jugdata/). ,
Redis.
, jug,
;
, ,
.
( ), -

m1

, 01-111
(, Redis).
1 ,
l\111 .

. ,
. , -
(
), , .
, ,
.

jug
Jug - ,
.
, )l< .

11,
. ,
, ?
, jug NuPy.
NuPy,
. Jug -
.
10 < ,,. Tal\1
. ,
:
;
;
;
;
.
, jug.
, ii
, .
:
from jug import TaskGenerator
import mahotas as mh
from glob import glob

Elll

12.


:
@TaskGenerator
def compute_texture(im):
from features import texture
imc = mh.imread(im)
return texture(mh.colors.rgb2gray(imc))
@TaskGenerator
def chist_file(fname):
from features import chist
im = mh.imread(fname)
return chist(im)

features 10.
w , .
, ,
. - ,
, .
, ,
.

askGenerator ,
, , np.array, np.hstack
:
import numpy as np
to array = TaskGenerator(np.array)
hstack = TaskGenerator(np.hstack)
haralicks = []
chists
[]
labels = [ J
# ,
#
basedir = ./SimpleimageDataset/'
# glob
images = glob('(J/*.jpg'.format(basedir))
for fname in sorted(images):
haralicks.append(compute_texture(fname))
chists.append(chist_file(fname))
# , xxxxOO.jpg
labels.append(fname[:-len('OO.jpg')])
haralicks = to array(haral1cks)

w
chists = to_array(chists)
labels = to_array(labels)

jug ,
, .
, jug.
@TaskGenerator
def accuracy(features, labels):
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn import cross_validation
clf = Pipeline([('preproc', StandardScaler()),
('classifier', LogisticRegression())J)
cv = cross validation.LeaveOneOut(len(features))
scores = coss validation.cross val score(
clf, features, labels, cv=cv)
return scores.mean()

, sklearn .
,
, r .
scores_base = accuracy(haralicks, labels)
scores_chist = accuracy(chists, labels)
combined = hstack([chists, haralicks])
scores_combined = accuracy(combined, labels)


. ,
:
@TaskGenerator
def print_results(scores):
with open('results.image.txt', 'w') as output:
for k,v in scores:
output.write('Accuracy [()]: {:.1%)\n'.format(
k, v.mean()))
print_results([
('base', scores base),
('chists', scores_chist),
('combined' , scores_comined),
])

. ,
jug:
$ jug execute image-classification.py

12. 1

, (
). 10 << ,> ,
, .
,
,
.
,
- {1 .
ahotas, ,
askGenerator:
@TaskGenerator
def compute_lbp(fname):
from mahotas.features import lbp
imc = mh.imread(fname)
im = mh.colors,rgb2grey(imc)
radius' 'points' .
# , .
return lbp(im, radius= 8, points = )


:
lbps = []
for fname in sorted(images):
# ,
lbps.append(compute_lbp(fname))
lbps = to_array(lbps)

accuracy, :
scores_lbps = accuracy(lbps, labels)
combined_all = hstack([chists, haralicks, lbps])
scores_combined_all = accuracy(combined_all, labels)
print_results ([
('base', scores_base),
('chists', scores_chist),
('lbps', scores_lbps),
('combined' , scores_combied),
('combined_all' , scores_combied_all),
] )

j ug execute, 1
, . 1111
jg . ,

Amazon Web Services

.1

, ,
. ,
.
J ;r jug,
,
.
jug inval idate: ,
.
,
.
jug status --cache: jug status
, --cache,
. 1l',
jgfile,
--cache --clear,
.
jug cleanup: .
.

,
, jugfile.
http://jug.rtfd .org
, .

Amazon Web Services



,
1 .
Aazon (http: // aws. amazon.com)
. ,
, (
).
, Aazon - ,
.
Amazon Web Services (AWS) - .
Elastic Compute Cloud (2),
II
.

D1111

12. w

. -,
, , ,
,
( , , - ).
,
:

(GPU). ,
, .
: Linux Windows (Linux
u
). 11
Linux, , 1,, Windows.

(f1ee tier).
, . .
, , .
-.
, ,
, '
, . ,
- ( ,
, <!, ,
), 111 ,
.
AWS : : II
. , Aazon
, - . ,
, II
-. ( ,
) .
, ,:,
-,
11.
Amazon.com ,
: ,
, ,
. ,
, .
, , ,
-

Amazon Web Services

1111ED

.

. .

Amazon Web Services - ,


. -
AWS. ,
, , ,
.

, http://aws.amazon.com/
. ,
. , ,
, .
,
,
.
, ,
AWS, . ,
, .
AWS .
, AWS:

_-......_ ..... _.....__,,,.._...,


_
__..

.....

"'_..........
::=.::::-.;:....
..

:,j.,..,."""

__ ...,_,, __
...., .. ,t_

v-n.-r

!.(a,,t .... _l;'\- .... -

..,.

__.,._

Set Start PaiJC

2 ( -
. Amazon

D1111

12. w

, , , ,
, -). 2:

... .,

--

,..,.-.ia.._.

J ........ 11
r..,.,..
.._.._

(.
Amazon). ,
, J. ,
( ),
( ,
- ).
2
. Launch Instance ( ),
,
:

A,,,lr,/,I---...

Stop 1: ChOoso an A1nnzon MDch1ne lmago (AMI)

--

f)

t'lolf

.:f,;

.........

m1

Amazon Web Services

azon Linux (
, Red Hat, SUSE Ubuntu,
, ). ,
,
. :

---......

......

........._
-

,,,sstt,

....

"

1,

""'

t2.micro (
t1.micro-ee ). ,
. Next,
, ,
:

......

....... ........
er ..... _..,.,..._.,

_t,.-., ..

........

12.

awskeys.
Create new key pair ( ).
, , awskeys. pem.
. SSH (Secure Shell),
.
, .

. ,
ronning ():

..,.....
....

1 """'""

lnt.tc..

o,c:T,tl,(1'8'

f,al.cS.Z-SC.-'f).1'-'4n<l...... 1_.p.,18_ww_.

.....

Fe.t.db,1ck

..-....

I- (!i IP),
, :
$ ssh -i awskeys.pem ec2-user@54.93.165.5

, ssh -i
, ,
. ec2-user
I- 54.93.165.5. , .
,
. root, ubuntu
( Ubuntu) fectora ( Fedora).
, Ui- ( OS)
:
$ chmod 600 awskeys.pem

Amazon Web Services

1111
. , SSH
.
l\1.
, , .
S ssh -1 awskeys.pe ec2-userec2-54-9-:65-5.eu-centrat-1.compute.anazonaws.con
Last togln: Thu Nov 13 87:43:33 2014 fron enn.en.de
--1

_J

__ )_

___ 1\ ___ 1 ___ 1

Amazon L inux !

https://ws.anzon.con/anazon-tinux-aml/1014.89-retease-,otes/
7 package(s) needed for security, out of 18 avaitae
Run sudo yum update to appty att updat,,.
(ec2-user!ip-1723126-129 -)S I

Liu-, sucto,
,
sucto.
upctate, t .

th-
Amazon Linux
1 ,
Python, NumPy
. ,
Amazon. tln-:
$ sudo yum - install python-devel \
python-pip numpy scipy python-matplotlib

mahotas ++:
$ sudo yum - install gcc-c++
113

, git,
:
$ sudo yum - install git

pip .1111 pip-python.


pip ,
:
$ sudo pip-python install -U pip
$ sudo pip install scikit-learn jug mahotas

pip .

12. 1

jg

;1 , ,
:
$ git clone \
https://githu.com/luispedro/BildingMachineLearningSystemsWithPython
$ cd BuildingMachineLearningSystemsWithPython
$ cd ch12

$ jug execute
, 1
. ( t2 .m i c r o)
u
.1 r .

.uu !
2
!\,1 - .
.
1111.
. .
- ,
. .1,
,
.
Change
instance type ( ).
, ct.xlarge .
, (
).

--''-.

AWS ,
.
(
, ), ,
Amazon.

, 11!\-1,
I-, .
I-.

Amozon Web Services

...

11111m1

Elastic IP ( 2)
I-. ,
. ,
.

,
jug, u:
$# 8
$ for counter in $(seq 8); do
>
j ug execute &
> done

jug status ,
. (
).
t2.micro, . icro
( ),
ct.xlarge 0,064 ( 2015,
A\VS).


StarC/uster
, -,
. , Amazon
API. , ,
.
,
1111 , AWS.
, MIT
StarCluster. , thn-, ,
Python:
$ sudo pip install starcluster

Amazon
.
, .
. ,
, :
$ starcluster help

11111

12. r


-! .starcluster/config. 11
.
,

AWS .
, /
. -,
SSH,
. -, AWS,


( ,
).
, AWS,
Security
Credentials.
AAII7HHF61USNOCAA.
.

11.
u
ini-: 11 ,
u,
=. aws info,
:
[aws info]
AWS-ACCESS- -ID = AAKIIT7HHF6IUSNOCAA
AWS_SECRET_ACCESS_KEY = < >
- .
StarCluster .

sallcluste - cluster smallcluster.
:
[cluster smallcluster]
KEYNAME = mykey
CLUSTER SIZE = 16
16 .
II (,
/LJI 1111:11111 r

Amozon Web Services

1111

).. StaCluste 1 ,
.
SSH :
$ starcluster createkey mykey - -/.ssh/mykey.rsa

, 16 ,
:
$ starcluster start smallcluster

11 .
, 16 ? StarCluster
.
, , ,
. ,
jug .
,
,
. :
1. .
2. (
).
3. .
Unix.
.
4. .
5. 1 .
.
,
! .
,
(,
BuildingMachineLearningSystemsWithPython):
$ dir= BuildingMachineLearningSystemsWithPython
$ starcluster put smallcluster $dir $dir

$dir,
. :
$ starcluster sshmaster smallcluster

I-
ssh, , -

12.

I- , StarC\uster
.
, StarC\uste
; , -
, 1,, 11
.

. , StaC\uster .
, ,
. StarC!t1ste1 !\1 .
, :
$ pip install jug mahotas scikit-learn
jgfile, ,
, .
:
# 1 /usr/bin/env bash
jug execute jugfile.py
run-jugfile.sh II chmod + run
jugfile. sh .
16 ::
$ for in $(seq 16); do
>
qsub -cwd run-jugfile.sh
> done
16 ,
run-jugfile. sh, jg.
, . ,
jug status ,
11. ,jg
, ,
.
.
, 1 .
-/results :
# mkdir -/results
# results.image.txt -/results
:
# exit

m1

11 AWS ( $
).
, starcluster get (
put):
$ starcluster get smallcluster results results

,
.
$ starcluster stop smallcluster
$ starcluster terminate smallcluster
,
.
.
: ,
StarCluster,
.
,
.
http: //star. mit. edu/cluster/
StarCluster,
.
.

, jttg, Python
,
. ,

( ) .

Python.
Aazon - AWS.

, ,
. StarCluster
,
,
.

m1

12. r 1

. .
, ,
, , .

, ,. .
-
( ,
). Python,

, NnPy. , ,
scikit-learn,
.
( NumPy),
.

.
,
. ,
u
1 t1 .

r 1111
n1111
0611
1 , ,
11, .
110
- , , .
- }I
,
.

Hr (dw Ng)
ii ii
s (http: //www.coursera.org). , 110
.


1. ,
.
1, Christopl1er Bishop <,Patte Recogitio
and Machie Learing,>. 1111ii
. 1111 u
111u.
, 1 1111
, Kevi11
. Mttrphy <,Machie Leaig: P1obailistic Pespective (www.cs.ubc.
ca/-murphyk/MLbook). ( 2012 ) -

J 1111

nn 11 w...

1111
. 1100
, 1,
.

-
MetaOptiize (http://metaoptimize.com/qa) - -
,
.
Cross Validated (http://stats.stackexchange.com) - ,
, .
, ,
, TwoToReal (http:! /www.twotoreal.
com). i,
, .

r,
, .
: http://hunch.net

. .r
.
II :
http://textanddatamining.logspot.de
1,, - .
.
r (Ed,vin Chcn): http://g.echen.me
- .
.
http://www.
:

machinedlearnings.com
- .
r .
FlowingData: http://flowingdata.
- , .
11{1 11.

...

1111

Siply Statistics: http://simplystatistics. org


li , u
.
u, -1r , : http://andrewgelman.
, ,, u,
,
.


, r111,
il ii
(UCI). http://ar
chive.ics.uci.edu/ml.


-
-! Kaggle (http://www.kaggl e.com)- , r
, .
II
.
i\1
:
( ). -
. ,
. - .
, - , 11
. ,
.
, 11 ,
,
.


,
tl. r, -

IJD 1111 11 ...


scikit-learn. ,
.
MDP(http: ! /mdp-tool ki t.sourceforge.
net): 1.
Brain (http:; /pybrain.org): Python
,
.
Machine Learning Toolkit (Milk) (http: ! /luispedro. org/
software/milk):
, ,
scikit-learn.
Pattern (http: //www.clips.ua.ac.be/pattern):
,
, API Google.
T,vitter .
ht tp:; /mloss. org,

. ,
: ,
- . ,
,
.

. ,

.
,
.

,
.

I II

Amazon Web Services



StarCluster
287
280
279
281
Anaconda Python, 27
Associated Press () 95
AvgSentlen 120
AvgWordlen 120

BaseEstimator 164
BernoulliNB 147

CommentCount 113
Coursera 293
CreationDate 113
Cross Validated, 294

Elastic Compute Cloud (2), 279


Enthought 27
F
F- 154
G

GaussianNB 147
gel_feature_names(), 164
Grid Engine 270
GridSearchCV 153

llris, 48
48
50
48
J

jug cleanup 279


jug invalidale 279
jug status --cache 279
jug, 269
286
275
273

Kaggle 295
83
L
Lasso 176
LSF (Load Sharing Facility) 270

Machine Learning Toolkit (Milk) 296


malplotlib, 26
matshow(), 216
MDP, 296
MelaOptimize, 294
MLComp, 86
mpmath, 145
MultinomialNB 147
N
Natural Language Toolkit (NLTK) 77

1m

79
77
NumAIICaps 120
NumExclams 120
NumPy
27
29
30
26
27
30

Talkbox SciKit 222


TfidNectorizer, 153
TimeToAnswer 116
Title 113
train_model(), 148
transform(documents), 164
TwoToReal 294

ViewCount 113

OwnerUserld 113

Pattern 296
PBS (t Batch System) 270
Penn Treebank, 160
placeCityOtsu 230
Postypeld 113
precision_recall_curve(), 131
PyBrain 296
Python
URL 27
27
Amazon Linux 285

s
SciKit, 69
SciPy
32
32
26
Seeds, 58
SentiWordNet 162
SIFT (-
) 227
SoX 209
specgram(), 210
StackOverflow 25
StarCluster,
287
SURF (Speeded Up Robust Features) 242

no
(AMGC) 221
144
()
260
263
261
261

201
206
199
204
200

137
136
157
136
147
137
,
103
204

l 280
65
, 60

m.1

r 294

jug, 275
jug, 273
jug, 271
269

278
270
() 211

213

54
54
244
, 103
AWS
jug 286
281
1h- Amazon
Linux 285
- 25

176

136

100

83
106
69
70
, 295

117

218
r 125, 215
215
k 115
11
117
109
213

-
220
130
111
, 116
. 118
213
134
115
130
47
110

83
83
82
87
85

(SEC) 179
60
171

(LDA) 93
95
(LDA) 264
r 125

128
126
242


242
227

22
33
293
-
220
110
65
172
264
36

53
50
57
53
57

57

210
209
WAV 209
211

,
82
75
73

71
- 80
77
76
137
BernoulliNB 147
GaussianNB 147
MultinomialNB 147
144
140
150

153
147
138
143
43


230

236
231
228

240
227

239
228
233
235
237
196
168
169
60

54
40
(AUC) 132
205
100, 248
260
264
249
197
,
81

260
( ) 260
, 263
264

111111

60, 249
253
250
259
257

250
59
186
189

188


() 218
97, 176
174
Lasso. scikit-learn 176
Lasso 177
r, 181
179
--N 178
L1 L2 175
176

196
186
191
195

144
143

160
170
r 77
,
168

157

93

(F) 223
93
106
100
159
164
166


122
121
122
123

136

. 176