Вы находитесь на странице: 1из 280

....................................................................................

10
.......................................................................................13

I. .........................................20
1. .......................................................21
................................................................................21
BeautifulSoup..................................................................................24
BeautifulSoup...............................................................................24
BeautifulSoup.....................................................................................26
........................................................28

2. HTML..........................................31
..........................................................................31
BeautifulSoup............................................................32
find() findAll()...............................................................................................34
BeautifulSoup...................................................................36
...................................37
-............38
...................................................39
.......................................................40
.....................................................................................41
BeautifulSoup.....................................................46
.....................................................................47
-.............................................................................................48
BeautifulSoup.................................................................................48

3. ................................................................50
..................................................................................50
..........................................................................................54
............................................................................57
...........................................................................................59
Scrapy............................................................................65

4. API..........................................................70
API.................................................................................................71
.............................................................................72
...............................................................................................................72
6

..............................................................................................73
.....................................................................................................................74
API.......................................................................................................75
Echo Nest.................................................................................................................76
......................................................................................76
Twitter.......................................................................................................................78
......................................................................................78
......................................................................................79
Google API..............................................................................................................83
......................................................................................83
......................................................................................84
JSON-.......................................................................................86
................................................................................88
API.........................................................................92

5. .............................................................94
..........................................................................................................94
CSV..............................................................97
MySQL.....................................................................................................................99
MySQL........................................................................................ 100
................................................................ 102
Python.................................................................................. 106
......... 109
MySQL.......................................................................... 112
........................................................................................... 115

6. ........................................................ 117
...................................................................................... 117
....................................................................................................................... 118
......................................... 119
CSV........................................................................................................................ 124
CSV-.................................................................................... 124
PDF........................................................................................................................ 126
Microsoft Word .docx..................................................................................... 128

II. .....................................132
7. ..............................................................133
................................................... 133
............................................................................... 136
7

........................................................................ 138
OpenRefine...................................................................................................... 139

8. ......................144
................................................................................... 145
.......................................................................................... 148
: ............................. 152
Natural Language Toolkit................................................................................. 156
............................................................................... 156
NLTK......................................... 156
NLTK............................... 160
.............................................................................. 163

9. , -...........165
requests......................................................................................... 165
.............................................................................. 166
, ................... 168
................................................................ 170
cookies.......................................................................... 171
HTTP-............................................................ 173
................................................. 174

10. JavaScript-............................................175
JavaScript...................................................................... 176
JavaScript.......................................... 177
Ajax HTML......................................................................... 180
JavaScript Python
Selenium........................................................................................................... 181
..................................................................................... 186

11.
...............................................................................................189
............................................................................................... 190
Pillow................................................................................................................ 190
Tesseract........................................................................................................... 191
NumPy.............................................................................................................. 192
................................... 193
,
-.................................................................................................. 196
8

CAPTCHA Tesseract.................................................. 198


sseract...................................................................................... 200
CAPTCHA .
.................................................................................................... 204

12. ..............................208
................................................. 209
..................................................................... 210
................................................................................... 210
cookies........................................................................................ 212
......................................................................................... 214
, -........... 215
............................................................. 215
................................................................. 217
..................................................... 219

13.
........................................................................................221
................................................................................ 222
?.................................................................... 222
unittest....................................................................... 223
.......................................................................... 224
Selenium............................................................. 227
.......................................................................... 227
Unittest Selenium?.................................................................................... 231

14. ............233
?................................................ 233
IP-..................................................... 234
........................................................... 235
Tor........................................................................................................................... 236
PySocks............................................................................................................ 237
........................................................................................... 238
-.............................................................. 238
.......................................................................................... 240
.............................................................................. 241
.......................................................................................... 242

. , Python..................244
Hello, World!.......................................................................... 244
9

. , ..............248

.
-.................................................................................252
, , , !............................. 252
........................................................................................... 254
................................................. 256
............ 258
robots.txt .............................................. 259

-..................................................................................................... 263
eBay Bidders Edge
....................................................................................................... 263

....................................................... 265
Google:
robots.txt...................................................................................................... 268

........................................................................................269
..........................................................................................270
.................................................................271


. ,
,
. , ,
. Google
, .
,
, .
, -
. -
,
, , -
.
- -
, . -
. -
. , ,

. .
- -
, , -
, , --
,
. , , ,
3 Facebook, 300
Twitter, 30 ,
, . ., . .

,
. ,

( ), (
). .
. -
. , -
, .
, , .
,
11

, , -
, -
, . , -
,
.
, -
, , -
, , ,
. , - -
,
. , -, -
,
, , -
, ,
.
,
,
, - -
. -
.
, -
, -
-. Google.
, -
, ,
, Google ,
, , -
, ,
-
.
-


Python.
Python .
.
, -
.
, -
,
12

, . Python
. 25
Java
C++, ,
-
. Python -
, -
,
, .
-
Python, .
, -
.
, -
. ,
, JavaScript ( -
) . .
, .
.
, ,
,
, ,
, , -
.
,
CleverDATA

-
. , - (web
scraping) . -
, ,
.
, -, -
, -
, , -,
.
-, ,
, , -
, -
, , .
, -
, .
- -

JavaScript, cookies. -
API c -.

, -
-,
-.
1, -
, -
. ,
(
). ,
- .

-?
-
, . - (web
scraping) ,

(screen scraping), (data
mining), - (web harvesting). ,
14

- (web
scraping), ,
- -
(bots).
-c
, , API ( , -
-). - -
, -,
(HTML ,
-), , -
.
-
, -
. -
( I)1
(II).

-?

, .
JavaScript, -
( ),
- -
( ). -
,
.
, - ,
.
Google cheapest flights to Boston
. Google
, - ,
, -

, -
1

(web-scraper) - (web-crawler). - -
,
. - ,
.
- (web-
spider). . .
15

. -
-
-
.
, , : ,
API? ( API, -
4.) , API , -
, .
.
API -
, , . -
API ( ),
- .
, API -
:
, API;
, , , API
;

, API.
, API ,
, ,
.
-
. ,

Python. , -
. , -
.
, ,
.
, -
, -

, -
.
-
. 2006 We Feel Fine (http://we-
feelfine.org/) -
, I feel Iam
16

feeling. , ,
, .
, -
, - -
-, -
.


-
, -
. -
Python ,
Python.
Python
, .
,
. A Python
3.x, .
Python 2.x Python 3.x, , , -
A.
Python, -
Introducing Python
. , ,
Intoduction to Python -
.
C , -
, -
, .

, -
, -
, , -, HTTP, HTML,
-, , (data
science) .
, .
- -
, , -
.

17

( ,
).
,
-. ,
,
.
.
, -
. -
-
, ,
.


:

, URL-,
, .

, -
( -
, , ,
, ).

, -
.

,
,
.

.

18

,
, http://pythonscraping.
com/code/.
, -
-. -
-
. ,
-
. ,
, -
. -
- -
OReilly.
, ,
.

.
, -
. -
, ISBN. : Web Scraping
with Python by Ryan Mitchell (OReilly). Copyright 2015 Ryan Mitchell,
978-1-491-91029-0.
, -
,
permissions@oreilly.com.

, -
,

, . OReilly
, -
,
, -
LinkeDrive, , , -
.
19

, ,
(Allyson MacDonald), (Brian Anderson),
(Miguel Grinberg) (Eric VanWyk) -
,
.

.
(Yale Specht), -
-
, -
.
, .
, (Jim Waldo),
, -, ,
Linux
The Art and Science of C.
I


-
-: Python,
-,
- .
,
, , -
.
, - -
, ,

. , 90% -,
, ,
. ,
(
) -:
HTML- c ;
-
;
;

().
,
, -
. , , ,
, , -
, .
,
.
1

-, -
, cises . , HTML-
, CSS-, JavaScript ,
, , -
, -
.
, GET- -
, HTML-
,
.


-
, -
. , , ,
, http://google.com,
.
, -
, -
, .
-
, ( HTML,
CSS JavaScript), .
, -
, -
. -.
,
. , -
:
1. , -
.
22

,
( TCP). -
-
IP.
2. -

MAC- IP- .

.
3. , -
/-
.
4. IP-.
5. ( -
80 -, -
, IP-
) -
-.
6.- .
- :
GET-;
: index.html.
7.- HTML-,


.
! .
, ? .

, Nexus 1990 .
, - -
,
, , . -
, , -
, , ,
. - , -
, -
( ) , , -
,
.
23

, Python:
from urllib.request import urlopen
html = urlopen("http://pythonscraping.com/pages/page1.html")
print(html.read())
scrapetest.py
:
$python scrapetest.py
, Py
thon2.x, , , Python 3.x -
:
$python3 scrapetest.py
HTML- http://bit.ly/1QjYgcd.
, HTML- page1.html, -
<web root>/pages ,
http://pythonscraping.com.
? -
. -
, JavaScript, CSS- ,
. - -
, <img src="cuteKitten.jpg">,
, cuteKitten.jpg
. ,
Python -
, HTML-.
, ?
from urllib.request import urlopen
, , urlopen
request urllib.
urllib urllib2?
urllib2 Python 2.x, , , -
urllib2 urllib. Python 3.x urlib2 -
urllib : urllib.request,urllib.
parse urllib.error. -
, , , ,
urllib.

urllib Python (
, ) -
24

, cookies
( ).
urllib -
, Python
(http://bit.ly/1FncvYE).
urlopen ,
.
( HTML-,
), -
.

BeautifulSoup
, ,
!
?
, !
BeautifulSoup -
,
. -
(Mock Turtle)1.
, BeautifulSoup
,
-, HTML-
Python,
XML-.

BeautifulSoup
BeautifulSoup Py
thon , .
BeautifulSoup 4 ( -
BS4). BeautifulSoup 4 -
Crummy.com. Linux :
$sudo apt-get install python-bs4
Mac:
$sudo easy_install pip

Mock Turtle Soup -, -


1

, ,
.
BeautifulSoup 25

pip. -
:
$pip install beautifulsoup4
.
,
Python 2.x 3.x, , , -
python3:
$python3 myScript.py
, python3 -
Python 2.x, Python 3.x:
$sudo python3 setup.py install
pip pip3, -
Python 3.x:
$pip3 install beautifulsoup4
Windows
Mac Linux. BeautifulSoup
4, , -
:
>python setup.py install
! BeautifulSoup -
Python . ,
Python :
$python
> from bs4 import BeautifulSoup
.
, exe- pip Windows,
:
>pip install beautifulsoup4



Python, ,
, -
,
Python, -
.
26

Python ,
. -
root-. ,
:
$ virtualenv scrapingEnv
scrapingEnv, -
:
$ cd scrapingEnv/
$ source bin/activate
,
, , .
-
.
scrapingEnv,
BeautifulSoup, :
(scrapingEnv)ryan$ pip install beautifulsoup4
(scrapingEnv)ryan$ python
> from bs4 import BeautifulSoup
>
,
,
:
(scrapingEnv)ryan$ deactivate
ryan$ python
> from bs4 import BeautifulSoup
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ImportError: No module named 'bs4'
-
- .
Python,
, -
.
,
, , -
, -
.

BeautifulSoup
BeautifulSoup
, , BeautifulSoup.
, , :
BeautifulSoup 27

from urllib.request import urlopen


from bs4 import BeautifulSoup
html = urlopen("http://www.pythonscraping.com/pages/page1.html")
bsObj = BeautifulSoup(html.read());
print(bsObj.h1)
:
<h1>An Interesting Title</h1>
, urlopen -
html.read(), HTML- .
HTML- BeautifulSoup
:
html <html><head>...</head><body>...</body></html>
head <head><title>A Useful Page<title></head>
title <title>A Useful Page</title>
body <body><h1>An Int...</h1><div>Lorem ip...</div>
</body>
h1 <h1>An Interesting Title</h1>
div <div>Lorem Ipsum dolor...</div>
, <h1>, -
, BeautifulSoup (html body
h1). h1
:
bsObj.h1

:
bsObj.html.body.h1
bsObj.body.h1
bsObj.html.h1
,
BeautifulSoup -
.
HTML ( XML),
-
. 3 -
BeautifulSoup, -
, BeautifulSoup
-.
28


. , - ,
. , -
-, ,
,
, , , ,
-
, . -
, , -
( ),
, !
(-
) , -
:
html = urlopen("http://www.pythonscraping.com/pages/page1.html")
, :
( -
);
.
HTTP. HTTP
404 Page Not Found, 500 Internal Server Error ..
urlopen
HTTPError. -
:
try:
html = urlopen("http://www.pythonscraping.com/pages/page1.html")
except HTTPError as e:
print(e)
# null, " "
else:
# . :
#exception catch, "else"

HTTP,
, else -
.
(, http://www.pythonscraping.
com URL- ), urlopen -
None. null, -
BeautifulSoup 29

. -
, html-
None:
if html is None:
print("URL is not found")
else:
#
, , -
, , -
. ,
BeautifulSoup,
, .
, , BeautifulSoup -
None. None
AttributeError.
( nonExistentTag ,
BeautifulSoup):
print(bsObj.nonExistentTag)
None.
. , , -
,
None, :
print(bsObj.nonExistentTag.someTag)
:
AttributeError: 'NoneType' object has no attribute 'someTag'
, ?
-
:
try:
badContent = bsObj.nonExistingTag.anotherTag
except AttributeError as e:
print("Tag was not found")
else:
if badContent == None:
print ("Tag was not found")
else:
print(badContent)
30

-
, , -
(, , ). ,
, , ,
-:
from urllib.request import urlopen
from urllib.error import HTTPError
from bs4 import BeautifulSoup
def getTitle(url):
try:
html = urlopen(url)
except HTTPError as e:
return None
try:
bsObj = BeautifulSoup(html.read())
title = bsObj.body.h1
except AttributeError as e:
return None
return title
title = getTitle("http://www.pythonscraping.com/pages/page1.html")
if title == None:
print("Title could not be found")
else:
print(title)
getTitle,
, None, -
- . ,
getTitle HTTPError, -
BeautifulSoup try. AttributeError -
( -
, html None, html.read() AttributeError).
try ,
, , -
AttributeError .

, .
, , , . -
getSiteHTML getTitle (
) -
.