Non-Statistical Language-Blind Morpheme (Candidate) Extraction - An Unsupervised Machine Learning Approach

Zachary Bornheimer
zbornheimer@gmail.com | University of South Florida

Honors College - Supervised Research

Non-statistical Language-Blind
Morpheme (Candidate)* Extraction
An Unsupervised Machine Learning Approach

*One cannot know if a particular grapheme represents a morpheme until meaning
can be assigned to each morpheme. I'll use the term morpheme throughout this
paper to mean morpheme candidate, but I'm referring to a morpheme candidate.

Submitted December 19, 2013
Revised March 5, 2014
Zach Bornheimer Nonstatistical Language-Blind Morpheme Extraction
2
Introduction/Abstract
During Fall 2013, I was given the opportunity to work on a project that would yield language-
blind non-statistical morpheme extraction. Morpheme lists are the keys to many research
projects in linguistics, however, they have to be tailor made for the language/experiment. I set
out to try and determine a way to create a morpheme list without knowing the language. While I
was somewhat successful in this approach, there were some flaws, namely false positives. The
data delivered is promising, however there are two hurdles: serial computing and meaning.
Meaning assigned to morphemes eliminates false positives, yet we are working without a gloss,
so we have quite a few false positiveswith more data, however, this may not be an issue.
Additionally, because of the computing power required to run the code, it would be ideal to
rewrite part of it to make use of the parallel computing power of a GPU (which may yield speed
enhancements of exponential orders of magnitude). Overall, this was a good first step in a
project that has much grander requirements.
3

Design Decisions

Paradigms
The approach for this project was non-statistical from its origins as the majority of work being
done in natural language processing is being done via statistics. I didn't really understand this
approachI came into the problem with the idea that humans are pattern based creatures and if
languages are developed through patterns and rules, the deconstruction of a language can be
done with patterns and rules. In terms of implementation, I had completed a bit of the work in
Kernighan & Ritchie's Book The C Programming Language (1998)as the industry calls it,
K&Rby August 2013, but I had reached the part of the book where it said to undertake a large
task:

"It's possible to write useful programs of considerable size, and it would probably be a good idea
if you paused long enough to do so" (Kernighan & Ritchie, 1998, last paragraph of Ch. 1 before
exercises).

While this task was quite substantial, the only subjects of it that I needed to learn from it that
were not presented in Chapter 1 of K&R were: memory management (pointers and
references), structs.

Language
I decided to take an unsupervised machine learning algorithm which would more mimic the
human acquisition process. For this research, I undertook the program's development in the C
Programming language for the following reasons:

1) I could control how memory was allocated and when it was freed
2) I could manipulate the memory itself
3) I could store memory addresses instead of duplicating content

When testing a simpler, more aggressive version of the system design in Perl6, I found that
memory usage was an insurmountable difficulty. My program in Perl6 assumed the corpora
supplied used a latin alphabet this program made no such assumption. I wrote that program
about a year ago (Dec 2012). Writing the program in C was something that needed to be done
from an artificial intelligence perspective as C would allow me to control the speed and intensity
of the program-as the program written in Perl6 was unable to, with sufficient time or memory,
handle reading Moby Dick, The Adventures of Huckleberry Finn, and/or War of the Worlds.
The goal of the program was to develop a morpheme list from these texts using the same basic
algorithm while controlling the memory and the assumptions of the design; this could only be
accomplished with the language C.

Computer Science Paradigms
As memory management was a goal, I decided early on to use the tool valgrind to help me
eliminate memory leaks and to optimize performance. Additionally, I took a somewhat
4
comfortable mixture of top-down and bottom-up programming paradigmsthe mixture was
dictated by the following factors:

1) Would the function be reusable for a different purpose than the calling function?
2) How complex would it be to transmit data between functions?

Often times, it would be simpler to just expand the function and not have to worry about data
transmission. Additionally, I am beginning the process of optimizing the code (and removing
my testing code) to make it more elegant and to speed the software up. Part of this optimization
process is to remove redundant code and semi redundant code; luckily, C makes it simple to
implement macros. The following is an explanation for a macro that exists in the software. The
purpose of this particular macro is to make sure that a particular pointer is non-null after
requesting new memory (E_REALLOC is a defined constant).

#define ASSERT(condition, error_code) \
if (condition == 0) { \
printf("Assertion: '%s' failed.\n", #condition); \
exit(error_code); \
}

#define REALLOC_CHECK(arg) \
ASSERT((arg != NULL), E_REALLOC);

The ideas here are that I can manipulate the data types involved in a macro so I can pass a
boolean and a char* simultaneouslythis allows for me to write:

REALLOC_CHECK(array)

Instead of:

if (array == NULL)
exit(E_REALLOC);

and have a verbose error message while retaining code brevity. Further, going along with the
paradigm of style, I started out new to C in August 2013 (for the most part) and thus did not
know much about style in C, so I decided to use a mix of style that I learned directly out of K&R
and the Linux Style Guide (Kroah-Hartman, 2002).

Another paradigm used was an unsupervised machine learning algorithm (UMLA) paradigm.
While the standard pragmatic flow of a UMLA is train => execute, this algorithm was able to get
UMLA results by defining the algorithm in a series of variables that were only defined by the
algorithm running on real data - for example, the algorithm discards data that doesn't occur
THRESHOLD_CONFIRMATION percent of the found morpheme candidates (so if there 100,000
words, and it found 415,000 morpheme candidates, to be confirmed, non-stems must account for
THRESHOLD_CONFIRMATION percent of the data). Along this lines, circumfixes are
identified by the following rules:

1) The prefix and suffix occur in equal frequency
5
2) The words that contain the prefix and the words that contain the suffix have a percent
similarity of THRESHOLD_CIRCUMFIX

With these data elements in mind, the algorithm generates and fills/modifies all the data structure
sizes and elements depending on the data's matching of thresholds (as to prevent outliers from
corrupting the data). Specifically, it will isolate morpheme candidates based on two n-grams,
generate morpheme lists, generate regex (regular expressions) and reconstruction data for each
morpheme in the list, and then the algorithm will tag morphemes (and groups of words that are
associated with the morpheme). This data is combined with other similar data when necessary
and reconstructed when modified.

The final CS goal of the project here was to successfully manage a large C project. As I have
never managed a C project prior to this, I needed to learn quite a bit about it. The general
consensus on the internet was:

1) Use .h files for non-functional code (prototypes, structs, macros,
constants, externs)
2) Use .c files for functional code (i.e. variable and function definitions)
3) Use a Makefile for compilation

In regards to the makefile, I ended up using make debug and make optimized for a debug
version that allows for profiling and better debugging and an optimized version that makes use of
the GNU C Compiler (gcc)'s -O flag.

Computational Linguistic Goals
Another goal of this project was to be able to break down a Context-Sensitive Language (CSL)
into a Context-Free Grammar (CFG). A CSL is a description of how a language's syntax and
semantics vary depending on the context in which the words occur. For example, the sentence:
"My word, I thought out loud, how awful; I proceeded to laugh
experiencing overwhelming schadenfreude," "awful" would be interpreted
differently if a few words later the speaker didn't say that s/he was experiencing schadenfreude.
This definition of awful is Context Sensitive; languages which rely on this principal (namely
natural languages) are considered Context-Sensitive Languages. CFGs can be derived from
Context-Sensitive Languages (which, by definition, are derived from Context-Sensitive
Grammars) and can also be derived from Context-Free Languagesanother way of describing
this is { !A" => !#" } # ! $ ($ = null) where ! = " = null (which transforms the
definition for a CSL into the definition of a Context-Free Grammar).

Noam Chomsky (1956) defined Context-Specific Grammars as such:
"A rule of the form

indicates that X can be rewritten as Y only in the context ZW" (Chomsky, 1956, p. 118).
6

Restated, CFGs are defined as a basic set of rules that are independent from one anothernamely:

{ V => w } where V could be described as a non-terminal which yields a specific w (a
terminal, non-terminal, or null).

We are treating the problem as a generation of terminals for a Context-Free Grammar consisting
of the following:
{ <word> => <prefix>*<stem>+<suffix>* } where each token in angle brackets is a
sequence of characters representing a morpheme class where * = 0 or more and + = 1 or more; a
prefix and stem combination may be a circumfix placed equidistantly from all stems or a
particular stem.

Further, this particular CFG can be represented as such:

Word/Morpheme CFG: {

<word> => <prefix>*<stem>+<suffix>* | <word> =
<part_word><infix><part_word>

<prefix> => morpheme-list.chosen-morpheme.type == PREFIX |
morpheme.prefix if morpheme-list.chosen-morpheme.type == CIRCUMFIX

<stem> => morpheme-list.chosen-morpheme.type == STEM

<suffix> => morpheme-list.chosen-morpheme.type == SUFFIX |
morpheme.prefix if morpheme-list.chosen-morpheme.type == CIRCUMFIX

} where morpheme-list.chosen-morpheme is a particular morpheme from the
morpheme list and part_word is defined as part of a word which follows the affixation rule
generated by a particular infix.

Luckily for us, we can generate these rules through tokenization. The program needed to be able
to identify parallel environments which would identify a form of lexical environment in which
each word occurs. From this, the programmed algorithm allows for the extraction of morphemes
based on the comparison of words in parallel environments. This parallelism is defined in
constants.h in terms of percentage of parallelism (specifically, it is defined in the constant:
THRESHOLD_SIMILAR_NGRAMS). The technical way it is examining the parallel
environments is by using a constant NGRAM_SIZE to determine the size of n-grams (groupings
of words). Subsequently, it then looks at the left and right haves of two n-grams and generates
an array of unique elements. It then uses the following formula to determine the percent
similarity between the two arrays:

int percent_similar = (double) 100.00 * ((double) (cl - ol) / (double) cl);
Where cl is the length of the combined unique array and ol is the length of the original
combined array.

How Much to Implement
7
The biggest factor on what was implemented vs what was intended to be implemented, was time.
Given only 3 months and needing to use 1 month to get through Chapter 1 of K&R, I ran into
time limitations. While this successfully analyzes and extracts morphemes, I intended to do this
in addition to writing an unsupervised language-blind morphosyntactic tagger and semantic
extraction mechanism. I also cover things I would like to change in this algorithm given more
time and the funding to do so in the future (this is covered in Future Ambitions).

Research Question
How do you, non-statistically, extract morphemes from prepared corpora to develop a morpheme
list with an unsupervised machine learning algorithm in the C Programming Language?

Work Accomplished
During this project, I successfully implemented a morpheme extraction algorithm. It assumes
nothing about a supplied corpus. It currently has memory leaks in which about ~0.10% of
memory is currently being leaked. The code is about 1800 lines, consists of 21 code files, a
Makefile, a LICENSE file, and a README file.

To run the program
You can download the code at https://github.com/zachbornheimer/morpheme-extraction, once
the code has been downloaded, you need the following tools:

1) make - https://www.gnu.org/software/make/
2) gcc - https://www.gnu.org/software/gcc/

In the directory where the software was downloaded, run: make optimized for the optimized
code. Further information on how to run the program is available in the file README.

Algorithm
The algorithm implemented used the following procedure:
1) Identify and activate command-line changes
2) Extract the Word-Delimiter
3) Develop n-gram relationships
4) If, --process-sequentially, --serial, --sequential, or --process
have been passed, find morphemes
5) Repeat back to step 3 until no more files
6) If --process is passed or no argument involving processing (see 4) is passed, find
morphemes.
7) Write the type of the morpheme (PREFIX, SUFFIX, STEM, INFIX, or CIRCUMFIX)
and the morpheme to a file specified with command-line parameter or the default file.

Extract the Word-Delimiter
We are defining a word-delimiter as a character or string of characters which has and contributes
no semantic value other than to delineate the end of semantic value in a grapheme sequence.

8
For this stage, the algorithm takes an input file and develops a unique array of all characters that
occur in the file. For each character in the unique array, it splits the processed file into a
sequence of non-null strings and tallies the number of non-null elements that exist. It looks at
the tallies generated for each possible word-delimiter candidate. If there is one character that has
the highest generated split word count, it is returned as the word-delimiter. If there is more than
one, it tests permutations according to the following algorithm (represented in pseudocode):

/* PERMUTATION ALGORITHM: */
while (string != testing_string) {
for (0 to len(testing_string) as position) {
move-to-the-right(testing_string.character-
at[position]);
increment(position);
}
}

It tests each permutation against the file that is being tested to find the number of non-null strings
and it compares it against all the permutations of all other files by storing the word-delimiter that
results in the largest frequency and said frequency. If it finds that another word-delimiter
candidate has equally large size (that isn't 0) it skips that file given that there are potentially two
differing word-delimiters, the file doesn't conform to the given word-delimiter definition. or
something wrong that happenedeither way, it's probably best to skip the file.

Develop N-Gram Relationships
For this mechanism, the algorithm splits the text into words and begins generating the n-gram
data structure based on a constant-defined NGRAM_SIZE (always an odd number).
Thanks to typecasting rules of a double onto an int, NGRAM_SIZE/2 always rounds and is
equivalent to (NGRAM_SIZE-1)/2 in Cwhenever I refer to NGRAM_SIZE/2, it refers to
the value that would be computed with C, (NGRAM_SIZE-1)/2. It sets ngram.word to be
the target word, it sets NGRAM_SIZE/2 words for ngram.before and NGRAM_SIZE/2
words for ngram.after. ngram.before and ngram.after represent the words that
occur before and after the target word (if they exist). If NGRAM_SIZE = 9 and this is the third
word in the corpus, ngram.before[0] and ngram.before[1] will be empty, but
ngram.before[2] and ngram.before[3] will contain data. If the target word occurs
more than once in the corpus, instead of creating a new n-gram, it adds to the existing n-gram (so
ngram.before may no longer contain a maximum of NGRAM_SIZE/2, but would rather
contain a N*(NGRAM_SIZE/2) where N is the number of times that ngram.word occurs in
the corpus.

Lastly, the algorithm goes through each n-gram and compares it's context to every other
subsequent n-gram whose target word doesn't occur within NGRAM_SIZE distance. This is
accomplished by combining the .before and .after arrays into one array and compares that
array to another n-gram's merged array. If the elements that exist within that array have a
percent similarity (as defined in the Computational Linguistic Goals) greater than or equal to
THRESHOLD_SIMILAR_NGRAMS, the original n-gram stores the address of the similar n-gram
in an array into a linked list. Instead of creating a doubly or circular linked list, we use a singly
linked list because it would have been redundant to store the relationship twice (ngram-x is
9
related to ngram-y AND ngram-y is related to ngram-x) as we don't care about the order of a
relationship, but rather that a relationship exists. Implemented, each ngram looks for similarity
starting with ngram[n+1] For example, if ngram[0] and ngram[5] are similar the
connection will be stored in ngram[0]. However, when it comes time for ngram[5] to look
for store the addresses of similar n-grams, it will start with ngram[6] as opposed to
ngram[0]. This decision originates from the connection between ngram[0] and ngram[5]
already being stored and not needing to duplicate that connectionany resulting data wouldn't
change if ngram[5] stored the address of ngram[0], in fact, it might become an infinite
regression of memory address when processing which would require additional precautions.
Thus, for simplicity's sake, the algorithm only looks at subsequent n-grams.

Find Morphemes
The idea of finding morphemes is fairly simple and revolves around a code sequence called
find_longest_match which takes string0 and string1 and tries to find the longest
common contiguous character string from position 0 of string0 and string1. The program
executes this on two target words with their characters non-reversed and reversed (to find
immediate prefixes and suffixes respectively). It stores this data. It then runs on the internals of
the mechanism by removing the first character of string1 and looking for common contiguous
string with a length greater than or equal to 2 characters. It stores this data after it makes sure
that the found morpheme isn't a proper subset of a prior morpheme (like { morpheme0:
subset, morpheme1: ubset }). It proceeds until string1 == NULL. If string1
== NULL and string0 != NULL, it will remove the first character of string0, reset
string1 and repeat removing characters from string1 until string1 = string2 =
NULL.

The program runs this algorithm for each combination of target words such that it follows the
permutation algorithm as defined in the description of the word-delimiter extraction.

It removes duplicate morphemes by fusion. It fuses the duplicate morphemes by generating
regular expressions based off the words in which the morpheme originated and combining them
character by character creating classes within two sets of parenthesis (representing arrays for pre-
morpheme and post-morpheme). From there it reconstructs the full regex from it's
reconstruction data (which is the raw regex stored by character position) which is analyzed for
type of morpheme.

Morphemes are tagged as stems if they occur at any point standing alone or if they are marked as
ether a prefix or suffix and then they occur as either a suffix or prefix (respectively). Prefixes
occur at the beginning of words, suffixes at the end, and infixes occur if they occur naturally and
account for THRESHOLD_CONFIRMATION percent of the morpheme candidates. Prefixes and
suffixes are automatically tagged as such if they meet the threshold and if they occur in the
correct position. Circumfixes are tagged by looking at the current morpheme's type. If it is a
suffix, it will look though all the morphemes for a prefixif it is a prefix, a suffixthat appears
in the same frequency and whose list of associated words has a percent agreement greater than or
equal to THRESHOLD_CIRCUMFIX while both morphemes also meet the confirmation
threshold. Anything marked UNDEF that meets the confirmation threshold is marked as an infix.
All morphemes are tagged UNDEF until this tagging process.
10

Data Observation
For morpheme tagging, the priority list is:

1) STEM
2) INFIX
3) CIRCUMFIX
4) PREFIX & SUFFIX
5) UNDEF
where lower value has higher priority.

Once something is tagged as STEM, it can't be taken down from that level. The being said, any
other type can be tagged STEM if a morpheme appears unbounded. This is done as a measure of
accuracy and this can be seen in the data supplied (see Data Sample).

11
Data Sample
The following is a data sample that was gleaned from running on the test corpus supplied in the
git repository under test-corpus - specifically, it is the first chapter of H. G. Wells's War of
the Worlds.

The results took 5.118s of real time (as measured by time program included with Mac OS X
10.9) and used 274.2 MB of RAM running the version created with make optimized serially:
./nlp --serial
Additionally, for the sake of ease, I ran sort on the program to sort it alphabetically for display:

./test-corpus/War_of_the_Worlds.txt
===================
INFIX: --
INFIX: ace
INFIX: ack
INFIX: ad
INFIX: ain
INFIX: al
INFIX: am
INFIX: an
INFIX: ance
INFIX: anet
INFIX: ar
INFIX: arer
INFIX: ars
INFIX: at
INFIX: ber
INFIX: bit
INFIX: ble
INFIX: ca
INFIX: ce
INFIX: cessar
INFIX: credibl
INFIX: ct
INFIX: de
INFIX: der
INFIX: dnight
INFIX: ea
INFIX: ead
INFIX: ec
INFIX: ects
INFIX: ed
INFIX: edibl
INFIX: een
INFIX: ef
INFIX: el
INFIX: eld
INFIX: eling
INFIX: em
INFIX: ember
INFIX: en
INFIX: ence
12
INFIX: ent
INFIX: er
INFIX: ern
INFIX: es
INFIX: ess
INFIX: et
INFIX: ey
INFIX: fe
INFIX: gard
INFIX: gh
INFIX: ght
INFIX: ha
INFIX: habit
INFIX: haw
INFIX: ht
INFIX: ib
INFIX: ibl
INFIX: igen
INFIX: igence
INFIX: ight
INFIX: il
INFIX: ile
INFIX: iles
INFIX: ill
INFIX: ilvy
INFIX: im
INFIX: ing
INFIX: int
INFIX: inted
INFIX: ion
INFIX: ir
INFIX: ist
INFIX: istence
INFIX: ite
INFIX: la
INFIX: lai
INFIX: ld
INFIX: le
INFIX: les
INFIX: lescop
INFIX: lescope
INFIX: lie
INFIX: ll
INFIX: llects
INFIX: lligen
INFIX: lligence
INFIX: llion
INFIX: lu
INFIX: me
INFIX: member
INFIX: mens
INFIX: mote
INFIX: mpla
INFIX: nc
13
INFIX: nce
INFIX: ncern
INFIX: nces
INFIX: nd
INFIX: ne
INFIX: ness
INFIX: nger
INFIX: ns
INFIX: nt
INFIX: ntur
INFIX: ntury
INFIX: ob
INFIX: ol
INFIX: om
INFIX: on
INFIX: onom
INFIX: op
INFIX: ope
INFIX: ople
INFIX: ou
INFIX: oud
INFIX: ough
INFIX: out
INFIX: ow
INFIX: owe
INFIX: pe
INFIX: pear
INFIX: per
INFIX: pers
INFIX: pl
INFIX: plain
INFIX: pp
INFIX: pul
INFIX: rd
INFIX: re
INFIX: rea
INFIX: retch
INFIX: rf
INFIX: rface
INFIX: rge
INFIX: ri
INFIX: rkness
INFIX: rld
INFIX: rm
INFIX: rn
INFIX: ro
INFIX: rs
INFIX: rshaw
INFIX: rt
INFIX: rth
INFIX: ru
INFIX: rutin
INFIX: rv
INFIX: scop
14
INFIX: scope
INFIX: se
INFIX: serv
INFIX: sh
INFIX: si
INFIX: ss
INFIX: ssar
INFIX: st
INFIX: stan
INFIX: stance
INFIX: tch
INFIX: te
INFIX: ted
INFIX: tel
INFIX: tell
INFIX: tellects
INFIX: telligen
INFIX: telligence
INFIX: tence
INFIX: ter
INFIX: tershaw
INFIX: th
INFIX: ti
INFIX: tin
INFIX: tion
INFIX: tronom
INFIX: ts
INFIX: ul
INFIX: und
INFIX: ur
INFIX: ury
INFIX: us
INFIX: use
INFIX: vy
INFIX: wer
PREFIX: "
PREFIX: 1
PREFIX: A
PREFIX: C
PREFIX: D
PREFIX: E
PREFIX: F
PREFIX: H
PREFIX: M
PREFIX: Mar
PREFIX: N
PREFIX: O
PREFIX: Pe
PREFIX: S
PREFIX: T
PREFIX: Th
PREFIX: Thi
PREFIX: _
PREFIX: a
15
PREFIX: ab
PREFIX: ac
PREFIX: af
PREFIX: ap
PREFIX: app
PREFIX: appear
PREFIX: astronom
PREFIX: b
PREFIX: ba
PREFIX: beg
PREFIX: bel
PREFIX: belie
PREFIX: bi
PREFIX: bl
PREFIX: bla
PREFIX: br
PREFIX: bri
PREFIX: bu
PREFIX: c
PREFIX: cal
PREFIX: centur
PREFIX: ch
PREFIX: cl
PREFIX: clo
PREFIX: cloud
PREFIX: co
PREFIX: compla
PREFIX: con
PREFIX: concern
PREFIX: cr
PREFIX: d
PREFIX: da
PREFIX: danger
PREFIX: di
PREFIX: dis
PREFIX: dist
PREFIX: distan
PREFIX: dr
PREFIX: du
PREFIX: e
PREFIX: ear
PREFIX: emp
PREFIX: ev
PREFIX: eve
PREFIX: ex
PREFIX: exc
PREFIX: exce
PREFIX: exp
PREFIX: expl
PREFIX: explain
PREFIX: eyes
PREFIX: f
PREFIX: fa
PREFIX: fai
16
PREFIX: fee
PREFIX: fi
PREFIX: field
PREFIX: fir
PREFIX: fl
PREFIX: fla
PREFIX: flam
PREFIX: fo
PREFIX: fr
PREFIX: g
PREFIX: ga
PREFIX: gr
PREFIX: gre
PREFIX: gu
PREFIX: gun
PREFIX: h
PREFIX: happ
PREFIX: hav
PREFIX: hea
PREFIX: her
PREFIX: hi
PREFIX: ho
PREFIX: hou
PREFIX: house
PREFIX: hu
PREFIX: i
PREFIX: illu
PREFIX: imm
PREFIX: imme
PREFIX: immens
PREFIX: inc
PREFIX: incredibl
PREFIX: inte
PREFIX: intell
PREFIX: intelligen
PREFIX: j
PREFIX: ju
PREFIX: k
PREFIX: l
PREFIX: large
PREFIX: li
PREFIX: lo
PREFIX: m
PREFIX: ma
PREFIX: mat
PREFIX: men
PREFIX: mi
PREFIX: mid
PREFIX: mil
PREFIX: min
PREFIX: mo
PREFIX: mor
PREFIX: mu
PREFIX: mus
17
PREFIX: n
PREFIX: necessar
PREFIX: ni
PREFIX: o
PREFIX: obs
PREFIX: observ
PREFIX: p
PREFIX: pa
PREFIX: pla
PREFIX: plan
PREFIX: po
PREFIX: point
PREFIX: popul
PREFIX: power
PREFIX: pr
PREFIX: pro
PREFIX: prob
PREFIX: r
PREFIX: ra
PREFIX: read
PREFIX: real
PREFIX: rec
PREFIX: rece
PREFIX: rem
PREFIX: remote
PREFIX: rush
PREFIX: s
PREFIX: sa
PREFIX: sc
PREFIX: scrutin
PREFIX: sec
PREFIX: see
PREFIX: ser
PREFIX: sho
PREFIX: sl
PREFIX: sli
PREFIX: slight
PREFIX: sm
PREFIX: so
PREFIX: sou
PREFIX: sp
PREFIX: spec
PREFIX: spi
PREFIX: sta
PREFIX: stead
PREFIX: str
PREFIX: stre
PREFIX: strea
PREFIX: stretch
PREFIX: stru
PREFIX: su
PREFIX: sup
PREFIX: sur
PREFIX: sw
18
PREFIX: swi
PREFIX: swim
PREFIX: t
PREFIX: ta
PREFIX: telescop
PREFIX: tha
PREFIX: thi
PREFIX: thir
PREFIX: tho
PREFIX: thou
PREFIX: thr
PREFIX: thre
PREFIX: thro
PREFIX: tr
PREFIX: tra
PREFIX: tw
PREFIX: twe
PREFIX: twel
PREFIX: twent
PREFIX: u
PREFIX: un
PREFIX: v
PREFIX: va
PREFIX: ve
PREFIX: vi
PREFIX: vo
PREFIX: vol
PREFIX: w
PREFIX: wa
PREFIX: water
PREFIX: we
PREFIX: wh
PREFIX: whi
PREFIX: wi
PREFIX: win
PREFIX: wis
PREFIX: wo
PREFIX: ye
STEM: For
STEM: I
STEM: It
STEM: Mars
STEM: Ogilvy
STEM: Ottershaw
STEM: The
STEM: about
STEM: after
STEM: all
STEM: are
STEM: as
STEM: be
STEM: black
STEM: bright
STEM: by
19
STEM: century
STEM: cool
STEM: darkness
STEM: day
STEM: distance
STEM: do
STEM: dust
STEM: earth
STEM: ever
STEM: existence
STEM: eye
STEM: faint
STEM: feeling
STEM: for
STEM: fro
STEM: gas
STEM: green
STEM: grey
STEM: he
STEM: hour
STEM: idea
STEM: in
STEM: inhabit
STEM: intellects
STEM: intelligence
STEM: is
STEM: it
STEM: last
STEM: life
STEM: light
STEM: lit
STEM: man
STEM: mean
STEM: midnight
STEM: miles
STEM: million
STEM: my
STEM: nearer
STEM: new
STEM: night
STEM: no
STEM: of
STEM: one
STEM: or
STEM: our
STEM: ours
STEM: paper
STEM: papers
STEM: part
STEM: people
STEM: planet
STEM: pointed
STEM: red
STEM: regard
20
STEM: remember
STEM: round
STEM: same
STEM: seem
STEM: since
STEM: slow
STEM: small
STEM: space
STEM: star
STEM: still
STEM: sun
STEM: surface
STEM: telescope
STEM: that
STEM: the
STEM: them
STEM: to
STEM: under
STEM: up
STEM: war
STEM: warm
STEM: was
STEM: world
STEM: years
SUFFIX: ated
SUFFIX: e,
SUFFIX: ed,
SUFFIX: er,
SUFFIX: ly
SUFFIX: ous
SUFFIX: ves

What's most interesting about this data is that it doesn't seem all that off for some sections and
wildly off in others. Suffixes seems to be about 60% correct insofar as it discerned very familiar
suffixes. Stems is 100% accurate. Prefixes and infixes are wildly off. This is in part due to the
nature of meaning missing. Once software tries to assign meaning to each morpheme, it will
facilitate eradication of false positives.

21
Future Ambitions
For the future, it would make sense to rewrite parts of this algorithm to make use of a GPU
through CUDA or OpenCL. This would allow most of the operations involving determining n-
gram parallelism to happen simultaneously as opposed to serially. This savings would be
exponential. In a 100,000 word corpus, there are 100,000! n-grams of size NGRAM_SIZE that
are processed in total. Each of those n-grams could be processed simultaneously instead of
sequentially, so it could, theoretically, take approximately the time for 1 n-gram of size
NGRAM_SIZE to be processed in the time it would take ~2.82 " 10^456,573 n-grams to
be processed (if there were no limits on GPU cores).

Another part that could be done for this is to make use of a job scheduler or to create a job
scheduler which would allow you to have this program run on multiple computers
simultaneously. The ideal host for this would be Amazon's EC2 service as they have the ability
to do GPU processing in a High Performance Cluster of machines. Running a final program on
this type of machinery would yield terrific results.

Finally, the most important aspect of my future ambitions would be to create a
morphosyntactic tagger and gloss generator (semantics extractor) using a corpus and nothing else
(semantics require semantic seed data as language and meaning were not associated in a bubble,
but rather are the result of associations in the real world). These two tools are the next logical
steps after this program as their code would function in a parallel manner to the morpheme
extraction algorithm.
22
References

Chomsky, N. (1956). Three Models For The Description Of Language. IEEE Transactions on
Information Theory, 2(3), 113-124.
Kerninghan, B. W., & Ritchie, D. M. (1998). The C programming language (2nd ed.).
Englewood Cliffs: Prentice Hall.
Kroah-Hartman, G. (2002, June 26). Kernel CodingStyle. The Linux Kernel Archives. Retrieved
December 12, 2013, from https://www.kernel.org/doc/Documentation/CodingStyle

Non-Statistical Language-Blind Morpheme (Candidate) Extraction - An Unsupervised Machine Learning Approach

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Non-Statistical Language-Blind Morpheme (Candidate) Extraction - An Unsupervised Machine Learning Approach

Загружено:

Авторское право:

Доступные форматы

Zachary Bornheimer

zbornheimer@gmail.com | University of South Florida

Вам также может понравиться