Академический Документы
Профессиональный Документы
Культура Документы
########
# This example is pretty much entirely based on this excellent blog post
# http://glowingpython.blogspot.in/2014/09/text-summarization-with-nltk.html
# Thanks to TheGlowingPython, the good soul that wrote this excellent article!
##############################################################################
########
##############################################################################
########
# to use stopwords, you need to have run nltk.download() first - one-off setup
##############################################################################
########
import nltk
nltk.download('stopwords')
##############################################################################
########
# We have use dictionaries so far, but now that we have covered classes - this is a good
# time to introduce defaultdict. THis is class that inherits from dictionary, but has
# one additional nice feature: Usually, a Python dictionary throws a KeyError if you try
# The defaultdict in contrast will simply create any items that you try to access
# (provided of course they do not exist yet). To create such a "default" item, it relies
##############################################################################
########
##############################################################################
########
##############################################################################
########
##############################################################################
########
##############################################################################
########
##############################################################################
########
class FrequencySummarizer:
# instantiated
# btw, note how the special keyword 'self' is passed in as the first
self._min_cut = min_cut
self._max_cut = max_cut
# Punctuation symbols and stopwords (common words like 'an','the' etc) are ignored
# i.e. each object (instance) of this class will have an independent version of these
# variables.
# Note how this function is used to set up the member variables to their appropriate values
# indentation changes - we are out of the constructor (member function, but we are still
inside)
# the class.
# One important note: if you are used to programming in Java or C#: if you define a variable
here
# i.e. outside a member function but inside the class - it becomes a STATIC member variable
# THis is an important difference from Java, C# (where all member variables would be defined
here)
# next method (member function) which takes in self (the special keyword for this same
object)
# as well as a list of sentences, and outputs a dictionary, where the keys are words, and
freq = defaultdict(int)
# with one difference: Usually, a Python dictionary throws a KeyError if you try
# The defaultdict in contrast will simply create any items that you try to access
# (provided of course they do not exist yet). THe 'int' passed in as argument tells
for s in word_sent:
# indentation changes - we are inside the for loop, for each sentence
for word in s:
# indentation changes again - this is an inner for loop, once per each word_sent
# in that sentence
# if the word is in the member variable (dictionary) self._stopwords, then ignore it,
# else increment the frequency. Had the dictionary freq been a regular dictionary (not a
# defaultdict, we would have had to first check whether this word is in the dict
freq[word] += 1
# Done with the frequency calculation - now go through our frequency list and do 2 things
# normalize the frequencies by dividing each by the highest frequency (this allows us to
# always have frequencies between 0 and 1, which makes comparing them easy
# filter out frequencies that are too high or too low. A trick that yields better results.
m = float(max(freq.values()))
for w in list(freq.keys()):
freq[w] = freq[w]/m
# indentation changes - we are inside the if statement - if we are here the word is either
# really common or really uncommon. In either case - delete it from our dictionary
del freq[w]
# remember that del can be used to remove a key-value pair from the dictionary
return freq
# next method (member function) which takes in self (the special keyword for this same
object)
# as well as the raw text, and the number of sentences we wish the summary to contain.
Return the
# summary
sents = sent_tokenize(text)
# assert is a way of making sure a condition holds true, else an exception is thrown. Used to
do
# sanity checks like making sure the summary is shorter than the original article.
# splits each sentence into words, then takes all of those lists (1 per sentence)
self._freq = self._compute_frequencies(word_sent)
# make a call to the method (member function) _compute_frequencies, and places that in
ranking = defaultdict(int)
# create an empty dictionary (of the superior defaultdict variety) to hold the rankings of the
# sentences.
# Indentation changes - we are inside the for loop. Oh! and this is a different type of for
loop
# A new built-in function, enumerate(), will make certain loops a bit clearer.
enumerate(sequence),
# will return (0, thing[0]), (1, thing[1]), (2, thing[2]), and so forth.
# for i in range(len(L)):
# item = L[i]
# L[i] = result
# This can be rewritten using enumerate() as:
# L[i] = result
for w in sent:
if w in self._freq:
# if this is not a stopword (common word), add the frequency of that word
ranking[i] += self._freq[w]
# OK - we are outside the for loop and now have rankings for all the sentences
# we want to return the first n sentences with highest ranking, use the nlargest function to
do so
# this function needs to know how to get the list of values to rank, so give it a function -
simply the
##############################################################################
########
##############################################################################
########
import urllib.request
from bs4 import BeautifulSoup
##############################################################################
########
# Introducing Beautiful Soup: " Beautiful Soup is a Python library for pulling data out of
# HTML and XML files. It works with your favorite parser to provide idiomatic ways of
# navigating, searching, and modifying the parse tree. It commonly saves programmers hours
# or days of work.
##############################################################################
########
def get_only_text_washington_post_url(url):
# This function takes in a URL as an argument, and returns only the text of the article in that
URL.
page = urllib.request.urlopen(url).read().decode('utf8')
soup = BeautifulSoup(page)
# use this code to get everything in that text that lies between a pair of
# <article> and </article> tags. We do this because we know that the URLs we are currently
# OK - we got everything between the <article> and </article> tags, but that everything
# Now - repeat, but this time we will only take what lies between <p> and </p> tags
# these are HTML tags for "paragraph" i.e. this should give us the actual text of the article
soup2 = BeautifulSoup(text)
# use this code to get everything in that text that lies between a pair of
# <p> and </p> tags. We do this because we know that the URLs we are currently
# Btw note that BeautifulSoup return the title without our doing anything special -
# this is why BeautifulSoup works so much better than say regular expressions at parsing HTML
##############################################################################
#######
##############################################################################
#######
someUrl = "https://www.washingtonpost.com/news/the-switch/wp/2015/08/06/why-kids-are-
meeting-more-strangers-online-than-ever-before/"
textOfUrl = get_only_text_washington_post_url(someUrl)
fs = FrequencySummarizer()
summary = fs.summarize(textOfUrl[1], 3)
print(summary)