Вы находитесь на странице: 1из 42

Introduction to Python Part 4: NLTK and other cool Python stu

Outline

Introduction to Python Part 4: NLTK and other cool Python stu


Alastair Burt Andreas Eisele Christian Federmann Torsten Marek Ulrich Schfer a
DFKI & Universitt des Saarlandes a

October 8th, 2009

Outline
Introduction to Python Part 4: NLTK and other cool Python stu

Todays Topics: 1 NLTK, the Natural Language Toolkit


Overview Tokenization and PoS Tagging Morphology and Feature Structures Accessing Corpora Chunking and CFG Parsing Classication and Clustering
2

Outline

Other cool Python stu


Building GUIs with TK (Tkinter) UNO: (Python-)programming OpenOce Zope & Plone Google App Engine

Summary and some thoughts

Introduction to Python Part 4: NLTK and other cool Python stu

What is NLTK? Installation NLP Pipeline

Part I NLTK Overview

What is NLTK?
Introduction to Python Part 4: NLTK and other cool Python stu

NLTK Natural Language Toolkit

What is NLTK? Installation NLP Pipeline

Developed by Steven Bird, Ewan Klein and Edward Loper Mainly addresses education and research Very good documentation, book online: http://www.nltk.org/book NLTK examples on slides taken from there We only present very simple approaches to NLP here, often based on regular expressions. More sophisticated approaches are discussed in the book.

Installation
Introduction to Python Part 4: NLTK and other cool Python stu

Required packages
1

What is NLTK? Installation NLP Pipeline

Python version 2.4, 2.5 or 2.6


http://nltk.googlecode.com/files/nltk-2.0b6.zip

2 NLTK source distribution

3 or NLTK Debian package:

http://nltk.googlecode.com/files/nltk_2.0b5-1_all.deb
4 NLTK data (corpora etc.): see http://www.nltk.org/data 5 further packages and installation instructions:

http://www.nltk.org/download

The NLP Analysis Pipeline


Introduction to Python Part 4: NLTK and other cool Python stu

From Text Strings to Text Understanding


1 2 3 4 5 6 7 8 9 10

Tokenization Morphological analysis Part-of-Speech tagging Named entity recognition Chunking (chunk parsing) Sentence boundary detection Syntactic parsing Anaphora resolution Semantic analysis Pragmatics, Reasoning, ...

Introduction to Python Part 4: NLTK and other cool Python stu

Tokenization and PoS Tagging Debugging Regular Expressions Regular Expression Tokenizer II Regular Expression Tagger

Part III Tokenization and PoS Tagging

Regular Expression Tokenizer


Introduction to Python Part 4: NLTK and other cool Python stu

Split input string into list of words remember re.findall() regexptokenizer1.py: Simple tokenizer

Tokenization and PoS Tagging Debugging Regular Expressions Regular Expression Tokenizer II Regular Expression Tagger

import nltk text = "Hello. Isnt this fun?" pattern = r\w+|[^\w\s]+ print nltk.tokenize.regexp_tokenize(text, pattern)

Result: [Hello, ., Isn, "", t, this, fun, ?]

Debugging Regular Expressions


Introduction to Python Part 4: NLTK and other cool Python stu

NLTK tool for debugging regular expressions


import nltk nltk.re_show(pattern, inputstring)

Tokenization and PoS Tagging Debugging Regular Expressions Regular Expression Tokenizer II Regular Expression Tagger

Example
import nltk nltk.re_show(o+, Computational linguistics is cool.)

Result:
C{o}mputati{o}nal linguistics is c{oo}l.

Regular Expression Tokenizer II


Introduction to Python Part 4: NLTK and other cool Python stu

Tokenize currency amounts and abbreviations correctly regexptokenizer2.py: Extended simple tokenizer
import nltk text = That poster costs $22.40. pattern = r(?x) \w+ # sequences of word characters | \$?\d+(\.\d+)? # currency amounts, e.g. $12.50 | ([A-Z]\.)+ # abbreviations, e.g. U.S.A. | [^\w\s]+ # sequences of punctuation nltk.tokenize.regexp_tokenize(text, pattern)

Tokenization and PoS Tagging Debugging Regular Expressions Regular Expression Tokenizer II Regular Expression Tagger

Result:
[That, poster, costs, $22.40.]

Regular Expression Tagger


Introduction to Python Part 4: NLTK and other cool Python stu

Part-of-Speech tagging assigns a word class to each input token regexptagger.py: Simple Tagger for English
import nltk

Tokenization and PoS Tagging Debugging Regular Expressions Regular Expression Tokenizer II Regular Expression Tagger

patterns = [ (r.*ing$, VBG), # gerunds (r.*ed$, VBD), # simple past (r.*es$, VBZ), # 3rd singular present (r.*ould$, MD), # modals (r.*\s$, NN$), # possessive nouns (r.*s$, NNS), # plural nouns (r^-?[0-9]+(.[0-9]+)?$, CD), # cardinal numbers (r.*, NN) # nouns (default) ] regexp_tagger = nltk.RegexpTagger(patterns) regexp_tagger.tag(nltk.corpus.brown.sents(categories=adventure)

Introduction to Python Part 4: NLTK and other cool Python stu

Porter Stemmer

Part IV Morphology/Stemmer

Morphology/Stemmer
Introduction to Python Part 4: NLTK and other cool Python stu

porter.py: Simple Stemmer (Porter)


import nltk stemmer = nltk.PorterStemmer() verbs = [appears, appear, appeared, calling, called] stems = [] for verb in verbs: stemmed_verb = stemmer.stem(verb) stems.append(stemmed_verb) sorted(set(stems))

Porter Stemmer

Result:
[appear, call]

Introduction to Python Part 4: NLTK and other cool Python stu

Corpora included Gutenberg Statistics UDHR word length distribution Concordance

Part V Accessing Corpora

Corpora included in NLTK


Introduction to Python Part 4: NLTK and other cool Python stu

Corpora included in NLTK Gutenberg Texts (subset) Brown Corpus UDHR CMU Pronunciation Dictionary TIMIT Senseval 2 ... and many more Example
import nltk nltk.corpus.brown.words() nltk.corpus.gutenberg.fileids()

Corpora included Gutenberg Statistics UDHR word length distribution Concordance

Gutenberg Corpus Statistics


Introduction to Python Part 4: NLTK and other cool Python stu

Code displays corpus lename, average word length, average sentence length, and the number of times each vocabulary item appears in the text on average. gutenberg.py: Compute simple corpus statistics
from nltk.corpus import gutenberg for filename in gutenberg.fileids(): r = gutenberg.raw(filename) w = gutenberg.words(filename) s = gutenberg.sents(filename) v = set(w) print filename, len(r)/len(w), len(w)/len(s), len(w)/len(v)

Corpora included Gutenberg Statistics UDHR word length distribution Concordance

UDHR corpus
Introduction to Python Part 4: NLTK and other cool Python stu

Contains the Universal Declaration of Human Rights in over 300 languages. Example requires matplotlib (see NLTK download page on how to get it). udhr.py: Compute and display word length distribution
import nltk, pylab def cld(lang): text = nltk.corpus.udhr.words(lang) fd = nltk.FreqDist(len(token) for token in text) ld = [100*fd.freq(i) for i in range(36)] return [sum(ld[0:i+1]) for i in range(len(ld))] langs = [Chickasaw-Latin1, English-Latin1, German_Deutsch-Latin1, Greenlandic_Inuktikut-Latin1, Hungarian_Magyar-Latin1, Ibibio_Efik-Latin1] dists = [pylab.plot(cld(l), label=l[:-7], linewidth=2) for l in langs] pylab.title(Cumulative Word Length Distrib. for Several Languages) pylab.legend(loc=lower right) pylab.show()

Corpora included Gutenberg Statistics UDHR word length distribution Concordance

Concordance
Introduction to Python Part 4: NLTK and other cool Python stu

concordance.py: Concordance on Brown Corpus


import nltk def concordance(word, context): "Generate a concordance for the word with the specified context window" for sent in nltk.corpus.brown.sents(categories=a): try: pos = sent.index(word) left = .join(sent[:pos]) right = .join(sent[pos+1:]) print %*s %s %-*s %\ (context, left[-context:], word, context, right[:context]) except ValueError: pass concordance(line, 32)

Corpora included Gutenberg Statistics UDHR word length distribution Concordance

Introduction to Python Part 4: NLTK and other cool Python stu

Chunker CFG Parser Earley Parser Classication and Clustering

Part VI Chunking and CFG Parsing

Chunker
Introduction to Python Part 4: NLTK and other cool Python stu

chunkparser.py: Chunk parser


import nltk

Chunker CFG Parser Earley Parser Classication and Clustering

grammar = r""" NP: {<DT|PP\$>?<JJ>*<NN>} # chunk determiner/possessive, # adjectives and nouns {<NNP>+} # chunk sequences of proper nouns """ cp = nltk.RegexpParser(grammar) tagged_tokens = [("Rapunzel", "NNP"), ("let", "VBD"), ("down", "RP"), ("her", "PP$"), ("golden", "JJ"), ("hair", "NN")] print cp.parse(tagged_tokens) cp.parse(tagged_tokens).draw()

CFG Parser
Introduction to Python Part 4: NLTK and other cool Python stu

Various Parsing Algorithms are implemented within NLTK and explained in detail in the NLTK book cfgparser.py: CFG parser

Chunker CFG Parser Earley Parser Classication and Clustering

import nltk grammar = nltk.parse_cfg(""" S -> NP VP VP -> V NP | V NP PP V -> "saw" | "ate" NP -> "John" | "Mary" | "Bob" | Det N | Det N PP Det -> "a" | "an" | "the" | "my" N -> "dog" | "cat" | "cookie" | "park" PP -> P NP P -> "in" | "on" | "by" | "with" """)

CFG Parser
Introduction to Python Part 4: NLTK and other cool Python stu

Recursive Decent Parser cfgparser.py: CFG parser example continued


sent = "Mary saw Bob".split() rd_parser = nltk.RecursiveDescentParser(grammar) for p in rd_parser.nbest_parse(sent): print p p.draw() sent = "John ate my cookie in the park".split() rd_parser = nltk.RecursiveDescentParser(grammar) for p in rd_parser.nbest_parse(sent): print p p.draw()

Chunker CFG Parser Earley Parser Classication and Clustering

Earley Parsing with Feature Structures


Introduction to Python Part 4: NLTK and other cool Python stu

Augment CFG with feature structures (e.g. to express agreement on morphologic properties) NLTK comes with a very simple implementation of feature structures Example: Earley Algorithm earley.py: Feature-Based Earley Parsing
import nltk tokens = Kim likes children.split() from nltk.parse import load_parser cp = load_parser(grammars/feat0.fcfg, trace=2) trees = cp.nbest_parse(tokens)

Chunker CFG Parser Earley Parser Classication and Clustering

Classication and Clustering


Introduction to Python Part 4: NLTK and other cool Python stu

Chunker CFG Parser Earley Parser Classication and Clustering

See the NLTK book... Additional support via numpy (optional numeric Python library)

Introduction to Python Part 4: NLTK and other cool Python stu

TK GUI WordNet browser TK Widgets Message Boxes FileOpen Dialog UNO Zope and Plone Google App Engine Some wise thoughts

Part VII ... and other cool stu: TK / Tkinter

TK / TKinter
Introduction to Python Part 4: NLTK and other cool Python stu

Easily and quickly create GUI dialogs with Python Support users in entering data with minimal typing amount Platform-independent Additional package may be required (on Debian/Ubuntu: python-tk) Doc: Fredrik Lundh An Introduction to Tkinter There are more ways to do GUI programming in Python ... e.g. used in conguration dialogs in Linux (KDE, Gnome: GTK)

TK GUI WordNet browser TK Widgets Message Boxes FileOpen Dialog UNO Zope and Plone Google App Engine Some wise thoughts

My rst TK GUI
Introduction to Python Part 4: NLTK and other cool Python stu

tk1.py: TK GUI with Label and Button


from Tkinter import *

TK GUI WordNet browser TK Widgets Message Boxes FileOpen Dialog UNO Zope and Plone Google App Engine Some wise thoughts

def end(): sys.exit(0) mywindow = Tk() w = Label(mywindow, text="Hello, world!") b = Button(mywindow, text="End", command = end) w.pack() b.pack() mywindow.mainloop()

Text Entry and Listbox Example


Introduction to Python Part 4: NLTK and other cool Python stu

wordnet.py: WordNet Polysemy Synset Inspector


from Tkinter import * from nltk import wordnet def showWordNetPolysemy(): liBox.delete(0,END) input = eText.get() poly = wordnet.synsets(input,wordnet.NOUN) for synset in poly: liBox.insert(END, synset) liBox.pack() mywindow = Tk() eText = Entry(mywindow, width=80) # Text box eText.pack() bGo = Button(mywindow, text="Show WordNet Polysemy", command=showWordNetPolysemy) bGo.pack() liBox = Listbox(mywindow, height=0, width=80) liBox.pack() mywindow.mainloop()

TK GUI WordNet browser TK Widgets Message Boxes FileOpen Dialog UNO Zope and Plone Google App Engine Some wise thoughts

Further simple GUI elements (Widgets)


Introduction to Python Part 4: NLTK and other cool Python stu

TK GUI WordNet browser TK Widgets Message Boxes FileOpen Dialog UNO Zope and Plone Google App Engine Some wise thoughts

TK Widgets Seen: labels, buttons, text box, list box Further elements: slider, checkbox, radio button, bitmap Primitive drawings: points, lines, circles, rectangles

Predened Message Boxes


Introduction to Python Part 4: NLTK and other cool Python stu

tk2.py: Predened Message Boxes


TK GUI WordNet browser TK Widgets Message Boxes FileOpen Dialog UNO Zope and Plone Google App Engine Some wise thoughts

from Tkinter import * import tkMessageBox mywindow = Tk() answer = tkMessageBox.askyesno("Yes or No?", "Please choose: Yes or No?") print answer

FileOpen Dialog
Introduction to Python Part 4: NLTK and other cool Python stu

tk3.py: FileOpen Dialog


from Tkinter import * import tkFileDialog fo_window = tkFileDialog.Open() filename = fo_window.show() if filename: print filename else: print "no file chosen"

TK GUI WordNet browser TK Widgets Message Boxes FileOpen Dialog UNO Zope and Plone Google App Engine Some wise thoughts

Python Power for OpenOce


Introduction to Python Part 4: NLTK and other cool Python stu

TK GUI WordNet browser TK Widgets Message Boxes FileOpen Dialog UNO Zope and Plone Google App Engine Some wise thoughts

Python-UNO bridge

(Remote-)controlling OpenOce through Python


Introduction to Python Part 4: NLTK and other cool Python stu

UNO stands for Universal Network Objects Component object model of OpenOce Idea: Interoperability between programming languages, object models and hardware architectures, either in process or over process boundaries, as well as in the intranet or the Internet. UNO components may be implemented in and accessed from any programming language for which a UNO implementation (AKA language binding) and an appropriate bridge or adapter exists OpenOce ships with built-in Python 2.3.4 interpreter

TK GUI WordNet browser TK Widgets Message Boxes FileOpen Dialog UNO Zope and Plone Google App Engine Some wise thoughts

Python UNO Example


Introduction to Python Part 4: NLTK and other cool Python stu

Python UNO Example Python Code (Macro) to create new OpenOce text (Writer) document, insert text and table. Sample code from /usr/lib/openoffice/share/Scripts/
python/pythonSamples/TableSample.py

TK GUI WordNet browser TK Widgets Message Boxes FileOpen Dialog UNO Zope and Plone Google App Engine Some wise thoughts

Required packages (in Debian/Ubuntu): openoffice.org, python-uno Example runs from within OO macro menu Alternatively, OO can be started as socket server, and external Python script controls OO (even remotely) Further documentation:
http://udk.openoffice.org/python/python-bridge.html

Zope and Plone


Introduction to Python Part 4: NLTK and other cool Python stu

TK GUI WordNet browser TK Widgets Message Boxes FileOpen Dialog UNO Zope and Plone Google App Engine Some wise thoughts

Zope: Application Server implemented in Python Dynamic HTML, Object Model, Database, Python Scripting http://www.zope.org Plone: CMS based on Zope http://www.plone.org Example Language Technology World Portal (http://www.lt-world.org)

By the way...
Introduction to Python Part 4: NLTK and other cool Python stu

... need a job? DFKI project TAKE: Science Information Systems Innovative applications using NLP, information & relation extraction on scientic papers in our own eld Annotation jobs: Coreference/Anaphora Programming jobs: Python and/or Java: GUI, NLP Interested? send me CV ... need credit points? Wintersemester 2008/09 Projektseminar / M.Sc. LS&T: Specialization course, area LT NLP+ML for Science Information Systems Various tasks: evaluation of existing system, implementation work, e.g. automatic extension of ontology (NLTK!)

TK GUI WordNet browser TK Widgets Message Boxes FileOpen Dialog UNO Zope and Plone Google App Engine Some wise thoughts

Google App Engine


Introduction to Python Part 4: NLTK and other cool Python stu

Google is a Python company You may use Google servers to run your Python (only) applications through the Google App Engine http://code.google.com/appengine/ Would Google be as successful as it is if it would be based on Java or Perl as much as on Python? Learned from Google: Do business faster with Python (not only computationally) Extremely important for a company growing as quickly as Google

TK GUI WordNet browser TK Widgets Message Boxes FileOpen Dialog UNO Zope and Plone Google App Engine Some wise thoughts

Some wise thoughts


Introduction to Python Part 4: NLTK and other cool Python stu

TK GUI WordNet browser TK Widgets Message Boxes FileOpen Dialog UNO Zope and Plone Google App Engine Some wise thoughts

Coming back to the Zen from rst lecture... Explicit or implicit? (remember Hello * 100 or list comprehensions...) Explicitness, maintainability: Java > Python > Perl Dynamic typing in Python is both advantage and disadvantage

Some wise thoughts


Introduction to Python Part 4: NLTK and other cool Python stu

Use the language that best solves your problem! Dont be dogmatic Python is a universal programming language that can help to solve many problems very quickly Python is very good for learning how to program Python scales up to world-size problems But it may not be the perfect programming language in every case

TK GUI WordNet browser TK Widgets Message Boxes FileOpen Dialog UNO Zope and Plone Google App Engine Some wise thoughts

Introduction to Python Part 4: NLTK and other cool Python stu

Exercises

Part VIII Exercises

Exercises
Introduction to Python Part 4: NLTK and other cool Python stu

Exercises

Tomorrow
1 2

Look at todays example code, experiment with it Get the exercise sheet http://www.dfki.de/~uschaefer/python09/ Try to solve the exercises

Introduction to Python Part 4: NLTK and other cool Python stu

Exercises

Thank you for your attention!

Вам также может понравиться