Академический Документы
Профессиональный Документы
Культура Документы
Outline
Outline
Introduction to Python Part 4: NLTK and other cool Python stu
Outline
What is NLTK?
Introduction to Python Part 4: NLTK and other cool Python stu
Developed by Steven Bird, Ewan Klein and Edward Loper Mainly addresses education and research Very good documentation, book online: http://www.nltk.org/book NLTK examples on slides taken from there We only present very simple approaches to NLP here, often based on regular expressions. More sophisticated approaches are discussed in the book.
Installation
Introduction to Python Part 4: NLTK and other cool Python stu
Required packages
1
http://nltk.googlecode.com/files/nltk_2.0b5-1_all.deb
4 NLTK data (corpora etc.): see http://www.nltk.org/data 5 further packages and installation instructions:
http://www.nltk.org/download
Tokenization Morphological analysis Part-of-Speech tagging Named entity recognition Chunking (chunk parsing) Sentence boundary detection Syntactic parsing Anaphora resolution Semantic analysis Pragmatics, Reasoning, ...
Tokenization and PoS Tagging Debugging Regular Expressions Regular Expression Tokenizer II Regular Expression Tagger
Split input string into list of words remember re.findall() regexptokenizer1.py: Simple tokenizer
Tokenization and PoS Tagging Debugging Regular Expressions Regular Expression Tokenizer II Regular Expression Tagger
import nltk text = "Hello. Isnt this fun?" pattern = r\w+|[^\w\s]+ print nltk.tokenize.regexp_tokenize(text, pattern)
Tokenization and PoS Tagging Debugging Regular Expressions Regular Expression Tokenizer II Regular Expression Tagger
Example
import nltk nltk.re_show(o+, Computational linguistics is cool.)
Result:
C{o}mputati{o}nal linguistics is c{oo}l.
Tokenize currency amounts and abbreviations correctly regexptokenizer2.py: Extended simple tokenizer
import nltk text = That poster costs $22.40. pattern = r(?x) \w+ # sequences of word characters | \$?\d+(\.\d+)? # currency amounts, e.g. $12.50 | ([A-Z]\.)+ # abbreviations, e.g. U.S.A. | [^\w\s]+ # sequences of punctuation nltk.tokenize.regexp_tokenize(text, pattern)
Tokenization and PoS Tagging Debugging Regular Expressions Regular Expression Tokenizer II Regular Expression Tagger
Result:
[That, poster, costs, $22.40.]
Part-of-Speech tagging assigns a word class to each input token regexptagger.py: Simple Tagger for English
import nltk
Tokenization and PoS Tagging Debugging Regular Expressions Regular Expression Tokenizer II Regular Expression Tagger
patterns = [ (r.*ing$, VBG), # gerunds (r.*ed$, VBD), # simple past (r.*es$, VBZ), # 3rd singular present (r.*ould$, MD), # modals (r.*\s$, NN$), # possessive nouns (r.*s$, NNS), # plural nouns (r^-?[0-9]+(.[0-9]+)?$, CD), # cardinal numbers (r.*, NN) # nouns (default) ] regexp_tagger = nltk.RegexpTagger(patterns) regexp_tagger.tag(nltk.corpus.brown.sents(categories=adventure)
Porter Stemmer
Part IV Morphology/Stemmer
Morphology/Stemmer
Introduction to Python Part 4: NLTK and other cool Python stu
Porter Stemmer
Result:
[appear, call]
Corpora included in NLTK Gutenberg Texts (subset) Brown Corpus UDHR CMU Pronunciation Dictionary TIMIT Senseval 2 ... and many more Example
import nltk nltk.corpus.brown.words() nltk.corpus.gutenberg.fileids()
Code displays corpus lename, average word length, average sentence length, and the number of times each vocabulary item appears in the text on average. gutenberg.py: Compute simple corpus statistics
from nltk.corpus import gutenberg for filename in gutenberg.fileids(): r = gutenberg.raw(filename) w = gutenberg.words(filename) s = gutenberg.sents(filename) v = set(w) print filename, len(r)/len(w), len(w)/len(s), len(w)/len(v)
UDHR corpus
Introduction to Python Part 4: NLTK and other cool Python stu
Contains the Universal Declaration of Human Rights in over 300 languages. Example requires matplotlib (see NLTK download page on how to get it). udhr.py: Compute and display word length distribution
import nltk, pylab def cld(lang): text = nltk.corpus.udhr.words(lang) fd = nltk.FreqDist(len(token) for token in text) ld = [100*fd.freq(i) for i in range(36)] return [sum(ld[0:i+1]) for i in range(len(ld))] langs = [Chickasaw-Latin1, English-Latin1, German_Deutsch-Latin1, Greenlandic_Inuktikut-Latin1, Hungarian_Magyar-Latin1, Ibibio_Efik-Latin1] dists = [pylab.plot(cld(l), label=l[:-7], linewidth=2) for l in langs] pylab.title(Cumulative Word Length Distrib. for Several Languages) pylab.legend(loc=lower right) pylab.show()
Concordance
Introduction to Python Part 4: NLTK and other cool Python stu
Chunker
Introduction to Python Part 4: NLTK and other cool Python stu
grammar = r""" NP: {<DT|PP\$>?<JJ>*<NN>} # chunk determiner/possessive, # adjectives and nouns {<NNP>+} # chunk sequences of proper nouns """ cp = nltk.RegexpParser(grammar) tagged_tokens = [("Rapunzel", "NNP"), ("let", "VBD"), ("down", "RP"), ("her", "PP$"), ("golden", "JJ"), ("hair", "NN")] print cp.parse(tagged_tokens) cp.parse(tagged_tokens).draw()
CFG Parser
Introduction to Python Part 4: NLTK and other cool Python stu
Various Parsing Algorithms are implemented within NLTK and explained in detail in the NLTK book cfgparser.py: CFG parser
import nltk grammar = nltk.parse_cfg(""" S -> NP VP VP -> V NP | V NP PP V -> "saw" | "ate" NP -> "John" | "Mary" | "Bob" | Det N | Det N PP Det -> "a" | "an" | "the" | "my" N -> "dog" | "cat" | "cookie" | "park" PP -> P NP P -> "in" | "on" | "by" | "with" """)
CFG Parser
Introduction to Python Part 4: NLTK and other cool Python stu
Augment CFG with feature structures (e.g. to express agreement on morphologic properties) NLTK comes with a very simple implementation of feature structures Example: Earley Algorithm earley.py: Feature-Based Earley Parsing
import nltk tokens = Kim likes children.split() from nltk.parse import load_parser cp = load_parser(grammars/feat0.fcfg, trace=2) trees = cp.nbest_parse(tokens)
See the NLTK book... Additional support via numpy (optional numeric Python library)
TK GUI WordNet browser TK Widgets Message Boxes FileOpen Dialog UNO Zope and Plone Google App Engine Some wise thoughts
TK / TKinter
Introduction to Python Part 4: NLTK and other cool Python stu
Easily and quickly create GUI dialogs with Python Support users in entering data with minimal typing amount Platform-independent Additional package may be required (on Debian/Ubuntu: python-tk) Doc: Fredrik Lundh An Introduction to Tkinter There are more ways to do GUI programming in Python ... e.g. used in conguration dialogs in Linux (KDE, Gnome: GTK)
TK GUI WordNet browser TK Widgets Message Boxes FileOpen Dialog UNO Zope and Plone Google App Engine Some wise thoughts
My rst TK GUI
Introduction to Python Part 4: NLTK and other cool Python stu
TK GUI WordNet browser TK Widgets Message Boxes FileOpen Dialog UNO Zope and Plone Google App Engine Some wise thoughts
def end(): sys.exit(0) mywindow = Tk() w = Label(mywindow, text="Hello, world!") b = Button(mywindow, text="End", command = end) w.pack() b.pack() mywindow.mainloop()
TK GUI WordNet browser TK Widgets Message Boxes FileOpen Dialog UNO Zope and Plone Google App Engine Some wise thoughts
TK GUI WordNet browser TK Widgets Message Boxes FileOpen Dialog UNO Zope and Plone Google App Engine Some wise thoughts
TK Widgets Seen: labels, buttons, text box, list box Further elements: slider, checkbox, radio button, bitmap Primitive drawings: points, lines, circles, rectangles
from Tkinter import * import tkMessageBox mywindow = Tk() answer = tkMessageBox.askyesno("Yes or No?", "Please choose: Yes or No?") print answer
FileOpen Dialog
Introduction to Python Part 4: NLTK and other cool Python stu
TK GUI WordNet browser TK Widgets Message Boxes FileOpen Dialog UNO Zope and Plone Google App Engine Some wise thoughts
TK GUI WordNet browser TK Widgets Message Boxes FileOpen Dialog UNO Zope and Plone Google App Engine Some wise thoughts
Python-UNO bridge
UNO stands for Universal Network Objects Component object model of OpenOce Idea: Interoperability between programming languages, object models and hardware architectures, either in process or over process boundaries, as well as in the intranet or the Internet. UNO components may be implemented in and accessed from any programming language for which a UNO implementation (AKA language binding) and an appropriate bridge or adapter exists OpenOce ships with built-in Python 2.3.4 interpreter
TK GUI WordNet browser TK Widgets Message Boxes FileOpen Dialog UNO Zope and Plone Google App Engine Some wise thoughts
Python UNO Example Python Code (Macro) to create new OpenOce text (Writer) document, insert text and table. Sample code from /usr/lib/openoffice/share/Scripts/
python/pythonSamples/TableSample.py
TK GUI WordNet browser TK Widgets Message Boxes FileOpen Dialog UNO Zope and Plone Google App Engine Some wise thoughts
Required packages (in Debian/Ubuntu): openoffice.org, python-uno Example runs from within OO macro menu Alternatively, OO can be started as socket server, and external Python script controls OO (even remotely) Further documentation:
http://udk.openoffice.org/python/python-bridge.html
TK GUI WordNet browser TK Widgets Message Boxes FileOpen Dialog UNO Zope and Plone Google App Engine Some wise thoughts
Zope: Application Server implemented in Python Dynamic HTML, Object Model, Database, Python Scripting http://www.zope.org Plone: CMS based on Zope http://www.plone.org Example Language Technology World Portal (http://www.lt-world.org)
By the way...
Introduction to Python Part 4: NLTK and other cool Python stu
... need a job? DFKI project TAKE: Science Information Systems Innovative applications using NLP, information & relation extraction on scientic papers in our own eld Annotation jobs: Coreference/Anaphora Programming jobs: Python and/or Java: GUI, NLP Interested? send me CV ... need credit points? Wintersemester 2008/09 Projektseminar / M.Sc. LS&T: Specialization course, area LT NLP+ML for Science Information Systems Various tasks: evaluation of existing system, implementation work, e.g. automatic extension of ontology (NLTK!)
TK GUI WordNet browser TK Widgets Message Boxes FileOpen Dialog UNO Zope and Plone Google App Engine Some wise thoughts
Google is a Python company You may use Google servers to run your Python (only) applications through the Google App Engine http://code.google.com/appengine/ Would Google be as successful as it is if it would be based on Java or Perl as much as on Python? Learned from Google: Do business faster with Python (not only computationally) Extremely important for a company growing as quickly as Google
TK GUI WordNet browser TK Widgets Message Boxes FileOpen Dialog UNO Zope and Plone Google App Engine Some wise thoughts
TK GUI WordNet browser TK Widgets Message Boxes FileOpen Dialog UNO Zope and Plone Google App Engine Some wise thoughts
Coming back to the Zen from rst lecture... Explicit or implicit? (remember Hello * 100 or list comprehensions...) Explicitness, maintainability: Java > Python > Perl Dynamic typing in Python is both advantage and disadvantage
Use the language that best solves your problem! Dont be dogmatic Python is a universal programming language that can help to solve many problems very quickly Python is very good for learning how to program Python scales up to world-size problems But it may not be the perfect programming language in every case
TK GUI WordNet browser TK Widgets Message Boxes FileOpen Dialog UNO Zope and Plone Google App Engine Some wise thoughts
Exercises
Exercises
Introduction to Python Part 4: NLTK and other cool Python stu
Exercises
Tomorrow
1 2
Look at todays example code, experiment with it Get the exercise sheet http://www.dfki.de/~uschaefer/python09/ Try to solve the exercises
Exercises