Вы находитесь на странице: 1из 5

Workshop 2b: Web Scraping with BeautifulSoup 4

COMP20008 Elements of Data Processing

0.1

Workshop 2b - Web Scraping with BeautifulSoup

During this task you will learn to scrape the web using BeautifulSoup4 and save the results to a .csv file. The typical procedure is:
1.
2.
3.
4.
5.

Define the task


Browse the website & determine the contents of interest
Identify the structure & write a script to extract one entity
Write an extraction function & automate the process
Extract and save/archive the data

Your task is to determine and plot the number of books pubished per year by OReilly about Web Development. First, browse
http://shop.oreilly.com and go to the Web Development section. Heres a few things to look out for:
Is the information you need displayed on the page? If the information required is not displayed on the page, its most likely
that you wont be able to scrape it (sometimes information is hidden from the user in the form of HTML comments, it is not the case
here).
How many items are displayed per page? Are all the items books? In our case we are only interested in the first 30 books.
However, if we wanted to scrape the entire website, for all categories, we would need a different stopping criterion as well (how many
books are there?). Now, pay attention to the url and browse different pages.
Does the url change? How many pages are there? Using this information you should infer what should change in the url to
be able to crawl the entire target category.
Is scraping allowed? Whenever you want to scrape data from a website you should first check to see if it has some sort of access
policy (e.g. http://oreilly.com/terms/). It is best to check for a robots.txt file that tells webcrawlers how to behave. (Take a look at
eBays robots.txt for a restrictive case). The important lines for OReilys robots.txt are:
Crawl-delay: 30
Request-rate: 1/30
The first tells us that we should wait 30 seconds between requests, the second that we should request only one page every 30 seconds.
Two different ways of saying the same thing. Not following these terms will lead to our crawlers being banned!

Task 1: Scraping from the web and plotting


Now that you have the url, have a look at the page source to figure out what data you need to extract. For each point, write the code in
the empty cell below it.
In [ ]: from bs4 import BeautifulSoup as bsoup
import requests, html5lib
base_url = ""
soup = bsoup(requests.get(base_url).text, html5lib)
1. Find all of the relevant < td > tag elements.
In [ ]: tds = soup(td, figure_out_class_keyword)
print(len(tds))
2. Write a function that filters out videos.
In [ ]: def is_video(td):
"""
Its a video if it has exactly one pricelabel and
the stripped text inside that pricelabel starts with Video
"""
# your code here:
return bool()
print(len([td for td in tds if not is_video(td)]))
3. Write a function that returns a dict with the title, author, price, date given a BeautifulSoup tag representing a book.
In [ ]: import re #

regex might prove useful

def book_info(td):
"""
Given a BeautifulSoup <td> Tag representing a book,
extract the books details and return a dict
"""
# your code here:
return {
"title" : title,
"authors" : authors,
"price" : price,n
}

"date" : date

print(book_info(tds[0]))
4. Scrape the website for all the books.
In [ ]: from time import sleep
books = []
num_pages = 5 # should be able to find 30 books in 5 pages
base_url = ""
for page_num in range(1, num_pages + 1):
# your code here:
print("Scraping page", page_num, ",", len(books), "books found so far")
# now we wait as requested in robots.txt
sleep(30)
The previous code block might take a while to process.
3

Wait for it to finish, then print first 5 book titles to check:


In [ ]: print( [book[title] for book in books[:5]] )
5. Now that you collected the data, plot the number of books published each year.
In [ ]: def get_year(book):
"""book["date"] looks like November 2014 so we need to
split on the space and then take the second piece"""
return int(book["date"].split()[1])
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
bookyears = [get_year(book) for book in books]
sns.distplot(bookyears)
plt.ylabel("Number of books")
plt.title("Web Dev!!")
plt.show()
Saving the books to a CSV file:
In [ ]: import csv
try:
with open(books.csv, w) as output_file:
dict_writer = csv.DictWriter(output_file, fieldnames=books[0].keys())
dict_writer.writeheader()
dict_writer.writerows(books)
except IOError as err:
print("I/O error({0}): {1}".format(err.errno, err.strerror))

Вам также может понравиться