Slides 4

Introduction to Python Part 4: NLTK and other cool Python stu
Outline

Alastair Burt Andreas Eisele Christian Federmann Torsten Marek Ulrich Schfer a
DFKI & Universitt des Saarlandes a
October 8th, 2009
Outline
Todays Topics: 1 NLTK, the Natural Language Toolkit

Overview Tokenization and PoS Tagging Morphology and Feature Structures Accessing Corpora Chunking and CFG Parsing Classication and Clustering
2
Outline
Other cool Python stu

Building GUIs with TK (Tkinter) UNO: (Python-)programming OpenOce Zope & Plone Google App Engine
Summary and some thoughts
What is NLTK? Installation NLP Pipeline
Part I NLTK Overview
What is NLTK?
NLTK Natural Language Toolkit
Developed by Steven Bird, Ewan Klein and Edward Loper Mainly addresses education and research Very good documentation, book online: http://www.nltk.org/book NLTK examples on slides taken from there We only present very simple approaches to NLP here, often based on regular expressions. More sophisticated approaches are discussed in the book.
Installation
Required packages
1
Python version 2.4, 2.5 or 2.6

http://nltk.googlecode.com/files/nltk-2.0b6.zip
2 NLTK source distribution
3 or NLTK Debian package:
http://nltk.googlecode.com/files/nltk_2.0b5-1_all.deb
4 NLTK data (corpora etc.): see http://www.nltk.org/data 5 further packages and installation instructions:
http://www.nltk.org/download
The NLP Analysis Pipeline

From Text Strings to Text Understanding

1 2 3 4 5 6 7 8 9 10
Tokenization Morphological analysis Part-of-Speech tagging Named entity recognition Chunking (chunk parsing) Sentence boundary detection Syntactic parsing Anaphora resolution Semantic analysis Pragmatics, Reasoning, ...
Tokenization and PoS Tagging Debugging Regular Expressions Regular Expression Tokenizer II Regular Expression Tagger
Part III Tokenization and PoS Tagging
Regular Expression Tokenizer

Split input string into list of words remember re.findall() regexptokenizer1.py: Simple tokenizer
import nltk text = "Hello. Isnt this fun?" pattern = r\w+|[^\w\s]+ print nltk.tokenize.regexp_tokenize(text, pattern)
Result: [Hello, ., Isn, "", t, this, fun, ?]
Debugging Regular Expressions

NLTK tool for debugging regular expressions

import nltk nltk.re_show(pattern, inputstring)
Example
import nltk nltk.re_show(o+, Computational linguistics is cool.)
Result:
C{o}mputati{o}nal linguistics is c{oo}l.
Regular Expression Tokenizer II

Tokenize currency amounts and abbreviations correctly regexptokenizer2.py: Extended simple tokenizer
import nltk text = That poster costs $22.40. pattern = r(?x) \w+ # sequences of word characters | \$?\d+(\.\d+)? # currency amounts, e.g. $12.50 | ([A-Z]\.)+ # abbreviations, e.g. U.S.A. | [^\w\s]+ # sequences of punctuation nltk.tokenize.regexp_tokenize(text, pattern)
Result:
[That, poster, costs, $22.40.]
Regular Expression Tagger

Part-of-Speech tagging assigns a word class to each input token regexptagger.py: Simple Tagger for English
import nltk
patterns = [ (r.*ing$, VBG), # gerunds (r.*ed$, VBD), # simple past (r.*es$, VBZ), # 3rd singular present (r.*ould$, MD), # modals (r.*\s$, NN$), # possessive nouns (r.*s$, NNS), # plural nouns (r^-?[0-9]+(.[0-9]+)?$, CD), # cardinal numbers (r.*, NN) # nouns (default) ] regexp_tagger = nltk.RegexpTagger(patterns) regexp_tagger.tag(nltk.corpus.brown.sents(categories=adventure)
Porter Stemmer
Part IV Morphology/Stemmer
Morphology/Stemmer
porter.py: Simple Stemmer (Porter)

import nltk stemmer = nltk.PorterStemmer() verbs = [appears, appear, appeared, calling, called] stems = [] for verb in verbs: stemmed_verb = stemmer.stem(verb) stems.append(stemmed_verb) sorted(set(stems))
Porter Stemmer
Result:
[appear, call]
Corpora included Gutenberg Statistics UDHR word length distribution Concordance
Part V Accessing Corpora
Corpora included in NLTK

Corpora included in NLTK Gutenberg Texts (subset) Brown Corpus UDHR CMU Pronunciation Dictionary TIMIT Senseval 2 ... and many more Example
import nltk nltk.corpus.brown.words() nltk.corpus.gutenberg.fileids()
Gutenberg Corpus Statistics

Code displays corpus lename, average word length, average sentence length, and the number of times each vocabulary item appears in the text on average. gutenberg.py: Compute simple corpus statistics
from nltk.corpus import gutenberg for filename in gutenberg.fileids(): r = gutenberg.raw(filename) w = gutenberg.words(filename) s = gutenberg.sents(filename) v = set(w) print filename, len(r)/len(w), len(w)/len(s), len(w)/len(v)
UDHR corpus
Contains the Universal Declaration of Human Rights in over 300 languages. Example requires matplotlib (see NLTK download page on how to get it). udhr.py: Compute and display word length distribution
import nltk, pylab def cld(lang): text = nltk.corpus.udhr.words(lang) fd = nltk.FreqDist(len(token) for token in text) ld = [100*fd.freq(i) for i in range(36)] return [sum(ld[0:i+1]) for i in range(len(ld))] langs = [Chickasaw-Latin1, English-Latin1, German_Deutsch-Latin1, Greenlandic_Inuktikut-Latin1, Hungarian_Magyar-Latin1, Ibibio_Efik-Latin1] dists = [pylab.plot(cld(l), label=l[:-7], linewidth=2) for l in langs] pylab.title(Cumulative Word Length Distrib. for Several Languages) pylab.legend(loc=lower right) pylab.show()
Concordance
concordance.py: Concordance on Brown Corpus

import nltk def concordance(word, context): "Generate a concordance for the word with the specified context window" for sent in nltk.corpus.brown.sents(categories=a): try: pos = sent.index(word) left = .join(sent[:pos]) right = .join(sent[pos+1:]) print %*s %s %-*s %\ (context, left[-context:], word, context, right[:context]) except ValueError: pass concordance(line, 32)
Chunker CFG Parser Earley Parser Classication and Clustering
Part VI Chunking and CFG Parsing
Chunker
chunkparser.py: Chunk parser

import nltk
grammar = r""" NP: {<DT|PP\$>?<JJ>*<NN>} # chunk determiner/possessive, # adjectives and nouns {<NNP>+} # chunk sequences of proper nouns """ cp = nltk.RegexpParser(grammar) tagged_tokens = [("Rapunzel", "NNP"), ("let", "VBD"), ("down", "RP"), ("her", "PP$"), ("golden", "JJ"), ("hair", "NN")] print cp.parse(tagged_tokens) cp.parse(tagged_tokens).draw()
CFG Parser
Various Parsing Algorithms are implemented within NLTK and explained in detail in the NLTK book cfgparser.py: CFG parser
import nltk grammar = nltk.parse_cfg(""" S -> NP VP VP -> V NP | V NP PP V -> "saw" | "ate" NP -> "John" | "Mary" | "Bob" | Det N | Det N PP Det -> "a" | "an" | "the" | "my" N -> "dog" | "cat" | "cookie" | "park" PP -> P NP P -> "in" | "on" | "by" | "with" """)
CFG Parser
Recursive Decent Parser cfgparser.py: CFG parser example continued

sent = "Mary saw Bob".split() rd_parser = nltk.RecursiveDescentParser(grammar) for p in rd_parser.nbest_parse(sent): print p p.draw() sent = "John ate my cookie in the park".split() rd_parser = nltk.RecursiveDescentParser(grammar) for p in rd_parser.nbest_parse(sent): print p p.draw()
Earley Parsing with Feature Structures

Augment CFG with feature structures (e.g. to express agreement on morphologic properties) NLTK comes with a very simple implementation of feature structures Example: Earley Algorithm earley.py: Feature-Based Earley Parsing
import nltk tokens = Kim likes children.split() from nltk.parse import load_parser cp = load_parser(grammars/feat0.fcfg, trace=2) trees = cp.nbest_parse(tokens)
Classication and Clustering

See the NLTK book... Additional support via numpy (optional numeric Python library)
TK GUI WordNet browser TK Widgets Message Boxes FileOpen Dialog UNO Zope and Plone Google App Engine Some wise thoughts
Part VII ... and other cool stu: TK / Tkinter
TK / TKinter
Easily and quickly create GUI dialogs with Python Support users in entering data with minimal typing amount Platform-independent Additional package may be required (on Debian/Ubuntu: python-tk) Doc: Fredrik Lundh An Introduction to Tkinter There are more ways to do GUI programming in Python ... e.g. used in conguration dialogs in Linux (KDE, Gnome: GTK)
My rst TK GUI
tk1.py: TK GUI with Label and Button

from Tkinter import *
def end(): sys.exit(0) mywindow = Tk() w = Label(mywindow, text="Hello, world!") b = Button(mywindow, text="End", command = end) w.pack() b.pack() mywindow.mainloop()
Text Entry and Listbox Example

wordnet.py: WordNet Polysemy Synset Inspector

from Tkinter import * from nltk import wordnet def showWordNetPolysemy(): liBox.delete(0,END) input = eText.get() poly = wordnet.synsets(input,wordnet.NOUN) for synset in poly: liBox.insert(END, synset) liBox.pack() mywindow = Tk() eText = Entry(mywindow, width=80) # Text box eText.pack() bGo = Button(mywindow, text="Show WordNet Polysemy", command=showWordNetPolysemy) bGo.pack() liBox = Listbox(mywindow, height=0, width=80) liBox.pack() mywindow.mainloop()
Further simple GUI elements (Widgets)

TK Widgets Seen: labels, buttons, text box, list box Further elements: slider, checkbox, radio button, bitmap Primitive drawings: points, lines, circles, rectangles
Predened Message Boxes

tk2.py: Predened Message Boxes

from Tkinter import * import tkMessageBox mywindow = Tk() answer = tkMessageBox.askyesno("Yes or No?", "Please choose: Yes or No?") print answer
FileOpen Dialog
tk3.py: FileOpen Dialog

from Tkinter import * import tkFileDialog fo_window = tkFileDialog.Open() filename = fo_window.show() if filename: print filename else: print "no file chosen"
Python Power for OpenOce

Python-UNO bridge
(Remote-)controlling OpenOce through Python

UNO stands for Universal Network Objects Component object model of OpenOce Idea: Interoperability between programming languages, object models and hardware architectures, either in process or over process boundaries, as well as in the intranet or the Internet. UNO components may be implemented in and accessed from any programming language for which a UNO implementation (AKA language binding) and an appropriate bridge or adapter exists OpenOce ships with built-in Python 2.3.4 interpreter
Python UNO Example

Python UNO Example Python Code (Macro) to create new OpenOce text (Writer) document, insert text and table. Sample code from /usr/lib/openoffice/share/Scripts/
python/pythonSamples/TableSample.py
Required packages (in Debian/Ubuntu): openoffice.org, python-uno Example runs from within OO macro menu Alternatively, OO can be started as socket server, and external Python script controls OO (even remotely) Further documentation:
http://udk.openoffice.org/python/python-bridge.html
Zope and Plone

Zope: Application Server implemented in Python Dynamic HTML, Object Model, Database, Python Scripting http://www.zope.org Plone: CMS based on Zope http://www.plone.org Example Language Technology World Portal (http://www.lt-world.org)
By the way...
... need a job? DFKI project TAKE: Science Information Systems Innovative applications using NLP, information & relation extraction on scientic papers in our own eld Annotation jobs: Coreference/Anaphora Programming jobs: Python and/or Java: GUI, NLP Interested? send me CV ... need credit points? Wintersemester 2008/09 Projektseminar / M.Sc. LS&T: Specialization course, area LT NLP+ML for Science Information Systems Various tasks: evaluation of existing system, implementation work, e.g. automatic extension of ontology (NLTK!)
Google App Engine

Google is a Python company You may use Google servers to run your Python (only) applications through the Google App Engine http://code.google.com/appengine/ Would Google be as successful as it is if it would be based on Java or Perl as much as on Python? Learned from Google: Do business faster with Python (not only computationally) Extremely important for a company growing as quickly as Google
Some wise thoughts

Coming back to the Zen from rst lecture... Explicit or implicit? (remember Hello * 100 or list comprehensions...) Explicitness, maintainability: Java > Python > Perl Dynamic typing in Python is both advantage and disadvantage
Some wise thoughts

Use the language that best solves your problem! Dont be dogmatic Python is a universal programming language that can help to solve many problems very quickly Python is very good for learning how to program Python scales up to world-size problems But it may not be the perfect programming language in every case
Exercises
Part VIII Exercises
Exercises
Exercises
Tomorrow
1 2
Look at todays example code, experiment with it Get the exercise sheet http://www.dfki.de/~uschaefer/python09/ Try to solve the exercises
Exercises
Thank you for your attention!

Slides 4

Загружено:

Сведения о документе

Исходное описание:

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Slides 4

Загружено:

Авторское право:

Доступные форматы

Introduction to Python Part 4: NLTK and other cool Python stu

Introduction to Python Part 4: NLTK and other cool Python stu

October 8th, 2009

Todays Topics: 1 NLTK, the Natural Language Toolkit

Other cool Python stu

Summary and some thoughts

Introduction to Python Part 4: NLTK and other cool Python stu

What is NLTK? Installation NLP Pipeline

Part I NLTK Overview

NLTK Natural Language Toolkit

What is NLTK? Installation NLP Pipeline

What is NLTK? Installation NLP Pipeline

Python version 2.4, 2.5 or 2.6

2 NLTK source distribution

3 or NLTK Debian package:

The NLP Analysis Pipeline

From Text Strings to Text Understanding

Introduction to Python Part 4: NLTK and other cool Python stu

Part III Tokenization and PoS Tagging

Regular Expression Tokenizer

Result: [Hello, ., Isn, "", t, this, fun, ?]

Debugging Regular Expressions

NLTK tool for debugging regular expressions

Regular Expression Tokenizer II

Regular Expression Tagger

Introduction to Python Part 4: NLTK and other cool Python stu

porter.py: Simple Stemmer (Porter)

Introduction to Python Part 4: NLTK and other cool Python stu

Corpora included Gutenberg Statistics UDHR word length distribution Concordance

Part V Accessing Corpora

Corpora included in NLTK

Corpora included Gutenberg Statistics UDHR word length distribution Concordance

Gutenberg Corpus Statistics

Corpora included Gutenberg Statistics UDHR word length distribution Concordance

Corpora included Gutenberg Statistics UDHR word length distribution Concordance

concordance.py: Concordance on Brown Corpus

Corpora included Gutenberg Statistics UDHR word length distribution Concordance

Introduction to Python Part 4: NLTK and other cool Python stu

Chunker CFG Parser Earley Parser Classication and Clustering

Part VI Chunking and CFG Parsing

chunkparser.py: Chunk parser

Chunker CFG Parser Earley Parser Classication and Clustering

Chunker CFG Parser Earley Parser Classication and Clustering

Recursive Decent Parser cfgparser.py: CFG parser example continued

Chunker CFG Parser Earley Parser Classication and Clustering

Earley Parsing with Feature Structures

Chunker CFG Parser Earley Parser Classication and Clustering

Classication and Clustering

Chunker CFG Parser Earley Parser Classication and Clustering

Introduction to Python Part 4: NLTK and other cool Python stu

Part VII ... and other cool stu: TK / Tkinter

tk1.py: TK GUI with Label and Button

Text Entry and Listbox Example

wordnet.py: WordNet Polysemy Synset Inspector

Further simple GUI elements (Widgets)

Predened Message Boxes

tk2.py: Predened Message Boxes

tk3.py: FileOpen Dialog

Python Power for OpenOce

(Remote-)controlling OpenOce through Python

Python UNO Example

Zope and Plone

Google App Engine