0 оценок0% нашли этот документ полезным (0 голосов)
66 просмотров22 страницы
A document describing research done in implementing an algorithm for morphemic discovery in the C programming language. Using the Algorithm as designed by Zach Bornheimer, moderate results were achieved. The accuracy of these results was explained in the paper.
This paper deals with machine learning, artificial intelligence, corpus and computational linguistics, and the C programming language.
Submitted as part of the Supervised Research course for the Honors College at USF.
Bornheimer (2014)
Оригинальное название
Non-Statistical Language-Blind Morpheme (Candidate) Extraction- An Unsupervised Machine Learning Approach
A document describing research done in implementing an algorithm for morphemic discovery in the C programming language. Using the Algorithm as designed by Zach Bornheimer, moderate results were achieved. The accuracy of these results was explained in the paper.
This paper deals with machine learning, artificial intelligence, corpus and computational linguistics, and the C programming language.
Submitted as part of the Supervised Research course for the Honors College at USF.
Bornheimer (2014)
A document describing research done in implementing an algorithm for morphemic discovery in the C programming language. Using the Algorithm as designed by Zach Bornheimer, moderate results were achieved. The accuracy of these results was explained in the paper.
This paper deals with machine learning, artificial intelligence, corpus and computational linguistics, and the C programming language.
Submitted as part of the Supervised Research course for the Honors College at USF.
Bornheimer (2014)
zbornheimer@gmail.com | University of South Florida
Honors College - Supervised Research
Non-statistical Language-Blind Morpheme (Candidate)* Extraction An Unsupervised Machine Learning Approach
*One cannot know if a particular grapheme represents a morpheme until meaning can be assigned to each morpheme. I'll use the term morpheme throughout this paper to mean morpheme candidate, but I'm referring to a morpheme candidate.
Submitted December 19, 2013 Revised March 5, 2014 Zach Bornheimer Nonstatistical Language-Blind Morpheme Extraction 2 Introduction/Abstract During Fall 2013, I was given the opportunity to work on a project that would yield language- blind non-statistical morpheme extraction. Morpheme lists are the keys to many research projects in linguistics, however, they have to be tailor made for the language/experiment. I set out to try and determine a way to create a morpheme list without knowing the language. While I was somewhat successful in this approach, there were some flaws, namely false positives. The data delivered is promising, however there are two hurdles: serial computing and meaning. Meaning assigned to morphemes eliminates false positives, yet we are working without a gloss, so we have quite a few false positiveswith more data, however, this may not be an issue. Additionally, because of the computing power required to run the code, it would be ideal to rewrite part of it to make use of the parallel computing power of a GPU (which may yield speed enhancements of exponential orders of magnitude). Overall, this was a good first step in a project that has much grander requirements. Zach Bornheimer Nonstatistical Language-Blind Morpheme Extraction 3
Design Decisions
Paradigms The approach for this project was non-statistical from its origins as the majority of work being done in natural language processing is being done via statistics. I didn't really understand this approachI came into the problem with the idea that humans are pattern based creatures and if languages are developed through patterns and rules, the deconstruction of a language can be done with patterns and rules. In terms of implementation, I had completed a bit of the work in Kernighan & Ritchie's Book The C Programming Language (1998)as the industry calls it, K&Rby August 2013, but I had reached the part of the book where it said to undertake a large task:
"It's possible to write useful programs of considerable size, and it would probably be a good idea if you paused long enough to do so" (Kernighan & Ritchie, 1998, last paragraph of Ch. 1 before exercises).
While this task was quite substantial, the only subjects of it that I needed to learn from it that were not presented in Chapter 1 of K&R were: memory management (pointers and references), structs.
Language I decided to take an unsupervised machine learning algorithm which would more mimic the human acquisition process. For this research, I undertook the program's development in the C Programming language for the following reasons:
1) I could control how memory was allocated and when it was freed 2) I could manipulate the memory itself 3) I could store memory addresses instead of duplicating content
When testing a simpler, more aggressive version of the system design in Perl6, I found that memory usage was an insurmountable difficulty. My program in Perl6 assumed the corpora supplied used a latin alphabet this program made no such assumption. I wrote that program about a year ago (Dec 2012). Writing the program in C was something that needed to be done from an artificial intelligence perspective as C would allow me to control the speed and intensity of the program-as the program written in Perl6 was unable to, with sufficient time or memory, handle reading Moby Dick, The Adventures of Huckleberry Finn, and/or War of the Worlds. The goal of the program was to develop a morpheme list from these texts using the same basic algorithm while controlling the memory and the assumptions of the design; this could only be accomplished with the language C.
Computer Science Paradigms As memory management was a goal, I decided early on to use the tool valgrind to help me eliminate memory leaks and to optimize performance. Additionally, I took a somewhat Zach Bornheimer Nonstatistical Language-Blind Morpheme Extraction 4 comfortable mixture of top-down and bottom-up programming paradigmsthe mixture was dictated by the following factors:
1) Would the function be reusable for a different purpose than the calling function? 2) How complex would it be to transmit data between functions?
Often times, it would be simpler to just expand the function and not have to worry about data transmission. Additionally, I am beginning the process of optimizing the code (and removing my testing code) to make it more elegant and to speed the software up. Part of this optimization process is to remove redundant code and semi redundant code; luckily, C makes it simple to implement macros. The following is an explanation for a macro that exists in the software. The purpose of this particular macro is to make sure that a particular pointer is non-null after requesting new memory (E_REALLOC is a defined constant).
The ideas here are that I can manipulate the data types involved in a macro so I can pass a boolean and a char* simultaneouslythis allows for me to write:
REALLOC_CHECK(array)
Instead of:
if (array == NULL) exit(E_REALLOC);
and have a verbose error message while retaining code brevity. Further, going along with the paradigm of style, I started out new to C in August 2013 (for the most part) and thus did not know much about style in C, so I decided to use a mix of style that I learned directly out of K&R and the Linux Style Guide (Kroah-Hartman, 2002).
Another paradigm used was an unsupervised machine learning algorithm (UMLA) paradigm. While the standard pragmatic flow of a UMLA is train => execute, this algorithm was able to get UMLA results by defining the algorithm in a series of variables that were only defined by the algorithm running on real data - for example, the algorithm discards data that doesn't occur THRESHOLD_CONFIRMATION percent of the found morpheme candidates (so if there 100,000 words, and it found 415,000 morpheme candidates, to be confirmed, non-stems must account for THRESHOLD_CONFIRMATION percent of the data). Along this lines, circumfixes are identified by the following rules:
1) The prefix and suffix occur in equal frequency Zach Bornheimer Nonstatistical Language-Blind Morpheme Extraction 5 2) The words that contain the prefix and the words that contain the suffix have a percent similarity of THRESHOLD_CIRCUMFIX
With these data elements in mind, the algorithm generates and fills/modifies all the data structure sizes and elements depending on the data's matching of thresholds (as to prevent outliers from corrupting the data). Specifically, it will isolate morpheme candidates based on two n-grams, generate morpheme lists, generate regex (regular expressions) and reconstruction data for each morpheme in the list, and then the algorithm will tag morphemes (and groups of words that are associated with the morpheme). This data is combined with other similar data when necessary and reconstructed when modified.
The final CS goal of the project here was to successfully manage a large C project. As I have never managed a C project prior to this, I needed to learn quite a bit about it. The general consensus on the internet was:
1) Use .h files for non-functional code (prototypes, structs, macros, constants, externs) 2) Use .c files for functional code (i.e. variable and function definitions) 3) Use a Makefile for compilation
In regards to the makefile, I ended up using make debug and make optimized for a debug version that allows for profiling and better debugging and an optimized version that makes use of the GNU C Compiler (gcc)'s -O flag.
Computational Linguistic Goals Another goal of this project was to be able to break down a Context-Sensitive Language (CSL) into a Context-Free Grammar (CFG). A CSL is a description of how a language's syntax and semantics vary depending on the context in which the words occur. For example, the sentence: "My word, I thought out loud, how awful; I proceeded to laugh experiencing overwhelming schadenfreude," "awful" would be interpreted differently if a few words later the speaker didn't say that s/he was experiencing schadenfreude. This definition of awful is Context Sensitive; languages which rely on this principal (namely natural languages) are considered Context-Sensitive Languages. CFGs can be derived from Context-Sensitive Languages (which, by definition, are derived from Context-Sensitive Grammars) and can also be derived from Context-Free Languagesanother way of describing this is { !A" => !#" } # ! $ ($ = null) where ! = " = null (which transforms the definition for a CSL into the definition of a Context-Free Grammar).
Noam Chomsky (1956) defined Context-Specific Grammars as such: "A rule of the form
indicates that X can be rewritten as Y only in the context ZW" (Chomsky, 1956, p. 118). Zach Bornheimer Nonstatistical Language-Blind Morpheme Extraction 6
Restated, CFGs are defined as a basic set of rules that are independent from one anothernamely:
{ V => w } where V could be described as a non-terminal which yields a specific w (a terminal, non-terminal, or null).
We are treating the problem as a generation of terminals for a Context-Free Grammar consisting of the following: { <word> => <prefix>*<stem>+<suffix>* } where each token in angle brackets is a sequence of characters representing a morpheme class where * = 0 or more and + = 1 or more; a prefix and stem combination may be a circumfix placed equidistantly from all stems or a particular stem.
Further, this particular CFG can be represented as such:
} where morpheme-list.chosen-morpheme is a particular morpheme from the morpheme list and part_word is defined as part of a word which follows the affixation rule generated by a particular infix.
Luckily for us, we can generate these rules through tokenization. The program needed to be able to identify parallel environments which would identify a form of lexical environment in which each word occurs. From this, the programmed algorithm allows for the extraction of morphemes based on the comparison of words in parallel environments. This parallelism is defined in constants.h in terms of percentage of parallelism (specifically, it is defined in the constant: THRESHOLD_SIMILAR_NGRAMS). The technical way it is examining the parallel environments is by using a constant NGRAM_SIZE to determine the size of n-grams (groupings of words). Subsequently, it then looks at the left and right haves of two n-grams and generates an array of unique elements. It then uses the following formula to determine the percent similarity between the two arrays:
int percent_similar = (double) 100.00 * ((double) (cl - ol) / (double) cl); Where cl is the length of the combined unique array and ol is the length of the original combined array.
How Much to Implement Zach Bornheimer Nonstatistical Language-Blind Morpheme Extraction 7 The biggest factor on what was implemented vs what was intended to be implemented, was time. Given only 3 months and needing to use 1 month to get through Chapter 1 of K&R, I ran into time limitations. While this successfully analyzes and extracts morphemes, I intended to do this in addition to writing an unsupervised language-blind morphosyntactic tagger and semantic extraction mechanism. I also cover things I would like to change in this algorithm given more time and the funding to do so in the future (this is covered in Future Ambitions).
Research Question How do you, non-statistically, extract morphemes from prepared corpora to develop a morpheme list with an unsupervised machine learning algorithm in the C Programming Language?
Work Accomplished During this project, I successfully implemented a morpheme extraction algorithm. It assumes nothing about a supplied corpus. It currently has memory leaks in which about ~0.10% of memory is currently being leaked. The code is about 1800 lines, consists of 21 code files, a Makefile, a LICENSE file, and a README file.
To run the program You can download the code at https://github.com/zachbornheimer/morpheme-extraction, once the code has been downloaded, you need the following tools:
1) make - https://www.gnu.org/software/make/ 2) gcc - https://www.gnu.org/software/gcc/
In the directory where the software was downloaded, run: make optimized for the optimized code. Further information on how to run the program is available in the file README.
Algorithm The algorithm implemented used the following procedure: 1) Identify and activate command-line changes 2) Extract the Word-Delimiter 3) Develop n-gram relationships 4) If, --process-sequentially, --serial, --sequential, or --process have been passed, find morphemes 5) Repeat back to step 3 until no more files 6) If --process is passed or no argument involving processing (see 4) is passed, find morphemes. 7) Write the type of the morpheme (PREFIX, SUFFIX, STEM, INFIX, or CIRCUMFIX) and the morpheme to a file specified with command-line parameter or the default file.
Extract the Word-Delimiter We are defining a word-delimiter as a character or string of characters which has and contributes no semantic value other than to delineate the end of semantic value in a grapheme sequence.
Zach Bornheimer Nonstatistical Language-Blind Morpheme Extraction 8 For this stage, the algorithm takes an input file and develops a unique array of all characters that occur in the file. For each character in the unique array, it splits the processed file into a sequence of non-null strings and tallies the number of non-null elements that exist. It looks at the tallies generated for each possible word-delimiter candidate. If there is one character that has the highest generated split word count, it is returned as the word-delimiter. If there is more than one, it tests permutations according to the following algorithm (represented in pseudocode):
/* PERMUTATION ALGORITHM: */ while (string != testing_string) { for (0 to len(testing_string) as position) { move-to-the-right(testing_string.character- at[position]); increment(position); } }
It tests each permutation against the file that is being tested to find the number of non-null strings and it compares it against all the permutations of all other files by storing the word-delimiter that results in the largest frequency and said frequency. If it finds that another word-delimiter candidate has equally large size (that isn't 0) it skips that file given that there are potentially two differing word-delimiters, the file doesn't conform to the given word-delimiter definition. or something wrong that happenedeither way, it's probably best to skip the file.
Develop N-Gram Relationships For this mechanism, the algorithm splits the text into words and begins generating the n-gram data structure based on a constant-defined NGRAM_SIZE (always an odd number). Thanks to typecasting rules of a double onto an int, NGRAM_SIZE/2 always rounds and is equivalent to (NGRAM_SIZE-1)/2 in Cwhenever I refer to NGRAM_SIZE/2, it refers to the value that would be computed with C, (NGRAM_SIZE-1)/2. It sets ngram.word to be the target word, it sets NGRAM_SIZE/2 words for ngram.before and NGRAM_SIZE/2 words for ngram.after. ngram.before and ngram.after represent the words that occur before and after the target word (if they exist). If NGRAM_SIZE = 9 and this is the third word in the corpus, ngram.before[0] and ngram.before[1] will be empty, but ngram.before[2] and ngram.before[3] will contain data. If the target word occurs more than once in the corpus, instead of creating a new n-gram, it adds to the existing n-gram (so ngram.before may no longer contain a maximum of NGRAM_SIZE/2, but would rather contain a N*(NGRAM_SIZE/2) where N is the number of times that ngram.word occurs in the corpus.
Lastly, the algorithm goes through each n-gram and compares it's context to every other subsequent n-gram whose target word doesn't occur within NGRAM_SIZE distance. This is accomplished by combining the .before and .after arrays into one array and compares that array to another n-gram's merged array. If the elements that exist within that array have a percent similarity (as defined in the Computational Linguistic Goals) greater than or equal to THRESHOLD_SIMILAR_NGRAMS, the original n-gram stores the address of the similar n-gram in an array into a linked list. Instead of creating a doubly or circular linked list, we use a singly linked list because it would have been redundant to store the relationship twice (ngram-x is Zach Bornheimer Nonstatistical Language-Blind Morpheme Extraction 9 related to ngram-y AND ngram-y is related to ngram-x) as we don't care about the order of a relationship, but rather that a relationship exists. Implemented, each ngram looks for similarity starting with ngram[n+1] For example, if ngram[0] and ngram[5] are similar the connection will be stored in ngram[0]. However, when it comes time for ngram[5] to look for store the addresses of similar n-grams, it will start with ngram[6] as opposed to ngram[0]. This decision originates from the connection between ngram[0] and ngram[5] already being stored and not needing to duplicate that connectionany resulting data wouldn't change if ngram[5] stored the address of ngram[0], in fact, it might become an infinite regression of memory address when processing which would require additional precautions. Thus, for simplicity's sake, the algorithm only looks at subsequent n-grams.
Find Morphemes The idea of finding morphemes is fairly simple and revolves around a code sequence called find_longest_match which takes string0 and string1 and tries to find the longest common contiguous character string from position 0 of string0 and string1. The program executes this on two target words with their characters non-reversed and reversed (to find immediate prefixes and suffixes respectively). It stores this data. It then runs on the internals of the mechanism by removing the first character of string1 and looking for common contiguous string with a length greater than or equal to 2 characters. It stores this data after it makes sure that the found morpheme isn't a proper subset of a prior morpheme (like { morpheme0: subset, morpheme1: ubset }). It proceeds until string1 == NULL. If string1 == NULL and string0 != NULL, it will remove the first character of string0, reset string1 and repeat removing characters from string1 until string1 = string2 = NULL.
The program runs this algorithm for each combination of target words such that it follows the permutation algorithm as defined in the description of the word-delimiter extraction.
It removes duplicate morphemes by fusion. It fuses the duplicate morphemes by generating regular expressions based off the words in which the morpheme originated and combining them character by character creating classes within two sets of parenthesis (representing arrays for pre- morpheme and post-morpheme). From there it reconstructs the full regex from it's reconstruction data (which is the raw regex stored by character position) which is analyzed for type of morpheme.
Morphemes are tagged as stems if they occur at any point standing alone or if they are marked as ether a prefix or suffix and then they occur as either a suffix or prefix (respectively). Prefixes occur at the beginning of words, suffixes at the end, and infixes occur if they occur naturally and account for THRESHOLD_CONFIRMATION percent of the morpheme candidates. Prefixes and suffixes are automatically tagged as such if they meet the threshold and if they occur in the correct position. Circumfixes are tagged by looking at the current morpheme's type. If it is a suffix, it will look though all the morphemes for a prefixif it is a prefix, a suffixthat appears in the same frequency and whose list of associated words has a percent agreement greater than or equal to THRESHOLD_CIRCUMFIX while both morphemes also meet the confirmation threshold. Anything marked UNDEF that meets the confirmation threshold is marked as an infix. All morphemes are tagged UNDEF until this tagging process. Zach Bornheimer Nonstatistical Language-Blind Morpheme Extraction 10
Data Observation For morpheme tagging, the priority list is:
1) STEM 2) INFIX 3) CIRCUMFIX 4) PREFIX & SUFFIX 5) UNDEF where lower value has higher priority.
Once something is tagged as STEM, it can't be taken down from that level. The being said, any other type can be tagged STEM if a morpheme appears unbounded. This is done as a measure of accuracy and this can be seen in the data supplied (see Data Sample).
Zach Bornheimer Nonstatistical Language-Blind Morpheme Extraction 11 Data Sample The following is a data sample that was gleaned from running on the test corpus supplied in the git repository under test-corpus - specifically, it is the first chapter of H. G. Wells's War of the Worlds.
The results took 5.118s of real time (as measured by time program included with Mac OS X 10.9) and used 274.2 MB of RAM running the version created with make optimized serially: ./nlp --serial Additionally, for the sake of ease, I ran sort on the program to sort it alphabetically for display:
./test-corpus/War_of_the_Worlds.txt =================== INFIX: -- INFIX: ace INFIX: ack INFIX: ad INFIX: ain INFIX: al INFIX: am INFIX: an INFIX: ance INFIX: anet INFIX: ar INFIX: arer INFIX: ars INFIX: at INFIX: ber INFIX: bit INFIX: ble INFIX: ca INFIX: ce INFIX: cessar INFIX: credibl INFIX: ct INFIX: de INFIX: der INFIX: dnight INFIX: ea INFIX: ead INFIX: ec INFIX: ects INFIX: ed INFIX: edibl INFIX: een INFIX: ef INFIX: el INFIX: eld INFIX: eling INFIX: em INFIX: ember INFIX: en INFIX: ence Zach Bornheimer Nonstatistical Language-Blind Morpheme Extraction 12 INFIX: ent INFIX: er INFIX: ern INFIX: es INFIX: ess INFIX: et INFIX: ey INFIX: fe INFIX: gard INFIX: gh INFIX: ght INFIX: ha INFIX: habit INFIX: haw INFIX: ht INFIX: ib INFIX: ibl INFIX: igen INFIX: igence INFIX: ight INFIX: il INFIX: ile INFIX: iles INFIX: ill INFIX: ilvy INFIX: im INFIX: ing INFIX: int INFIX: inted INFIX: ion INFIX: ir INFIX: ist INFIX: istence INFIX: ite INFIX: la INFIX: lai INFIX: ld INFIX: le INFIX: les INFIX: lescop INFIX: lescope INFIX: lie INFIX: ll INFIX: llects INFIX: lligen INFIX: lligence INFIX: llion INFIX: lu INFIX: me INFIX: member INFIX: mens INFIX: mote INFIX: mpla INFIX: nc Zach Bornheimer Nonstatistical Language-Blind Morpheme Extraction 13 INFIX: nce INFIX: ncern INFIX: nces INFIX: nd INFIX: ne INFIX: ness INFIX: nger INFIX: ns INFIX: nt INFIX: ntur INFIX: ntury INFIX: ob INFIX: ol INFIX: om INFIX: on INFIX: onom INFIX: op INFIX: ope INFIX: ople INFIX: ou INFIX: oud INFIX: ough INFIX: out INFIX: ow INFIX: owe INFIX: pe INFIX: pear INFIX: per INFIX: pers INFIX: pl INFIX: plain INFIX: pp INFIX: pul INFIX: rd INFIX: re INFIX: rea INFIX: retch INFIX: rf INFIX: rface INFIX: rge INFIX: ri INFIX: rkness INFIX: rld INFIX: rm INFIX: rn INFIX: ro INFIX: rs INFIX: rshaw INFIX: rt INFIX: rth INFIX: ru INFIX: rutin INFIX: rv INFIX: scop Zach Bornheimer Nonstatistical Language-Blind Morpheme Extraction 14 INFIX: scope INFIX: se INFIX: serv INFIX: sh INFIX: si INFIX: ss INFIX: ssar INFIX: st INFIX: stan INFIX: stance INFIX: tch INFIX: te INFIX: ted INFIX: tel INFIX: tell INFIX: tellects INFIX: telligen INFIX: telligence INFIX: tence INFIX: ter INFIX: tershaw INFIX: th INFIX: ti INFIX: tin INFIX: tion INFIX: tronom INFIX: ts INFIX: ul INFIX: und INFIX: ur INFIX: ury INFIX: us INFIX: use INFIX: vy INFIX: wer PREFIX: " PREFIX: 1 PREFIX: A PREFIX: C PREFIX: D PREFIX: E PREFIX: F PREFIX: H PREFIX: M PREFIX: Mar PREFIX: N PREFIX: O PREFIX: Pe PREFIX: S PREFIX: T PREFIX: Th PREFIX: Thi PREFIX: _ PREFIX: a Zach Bornheimer Nonstatistical Language-Blind Morpheme Extraction 15 PREFIX: ab PREFIX: ac PREFIX: af PREFIX: ap PREFIX: app PREFIX: appear PREFIX: astronom PREFIX: b PREFIX: ba PREFIX: beg PREFIX: bel PREFIX: belie PREFIX: bi PREFIX: bl PREFIX: bla PREFIX: br PREFIX: bri PREFIX: bu PREFIX: c PREFIX: cal PREFIX: centur PREFIX: ch PREFIX: cl PREFIX: clo PREFIX: cloud PREFIX: co PREFIX: compla PREFIX: con PREFIX: concern PREFIX: cr PREFIX: d PREFIX: da PREFIX: danger PREFIX: di PREFIX: dis PREFIX: dist PREFIX: distan PREFIX: dr PREFIX: du PREFIX: e PREFIX: ear PREFIX: emp PREFIX: ev PREFIX: eve PREFIX: ex PREFIX: exc PREFIX: exce PREFIX: exp PREFIX: expl PREFIX: explain PREFIX: eyes PREFIX: f PREFIX: fa PREFIX: fai Zach Bornheimer Nonstatistical Language-Blind Morpheme Extraction 16 PREFIX: fee PREFIX: fi PREFIX: field PREFIX: fir PREFIX: fl PREFIX: fla PREFIX: flam PREFIX: fo PREFIX: fr PREFIX: g PREFIX: ga PREFIX: gr PREFIX: gre PREFIX: gu PREFIX: gun PREFIX: h PREFIX: happ PREFIX: hav PREFIX: hea PREFIX: her PREFIX: hi PREFIX: ho PREFIX: hou PREFIX: house PREFIX: hu PREFIX: i PREFIX: illu PREFIX: imm PREFIX: imme PREFIX: immens PREFIX: inc PREFIX: incredibl PREFIX: inte PREFIX: intell PREFIX: intelligen PREFIX: j PREFIX: ju PREFIX: k PREFIX: l PREFIX: large PREFIX: li PREFIX: lo PREFIX: m PREFIX: ma PREFIX: mat PREFIX: men PREFIX: mi PREFIX: mid PREFIX: mil PREFIX: min PREFIX: mo PREFIX: mor PREFIX: mu PREFIX: mus Zach Bornheimer Nonstatistical Language-Blind Morpheme Extraction 17 PREFIX: n PREFIX: necessar PREFIX: ni PREFIX: o PREFIX: obs PREFIX: observ PREFIX: p PREFIX: pa PREFIX: pla PREFIX: plan PREFIX: po PREFIX: point PREFIX: popul PREFIX: power PREFIX: pr PREFIX: pro PREFIX: prob PREFIX: r PREFIX: ra PREFIX: read PREFIX: real PREFIX: rec PREFIX: rece PREFIX: rem PREFIX: remote PREFIX: rush PREFIX: s PREFIX: sa PREFIX: sc PREFIX: scrutin PREFIX: sec PREFIX: see PREFIX: ser PREFIX: sho PREFIX: sl PREFIX: sli PREFIX: slight PREFIX: sm PREFIX: so PREFIX: sou PREFIX: sp PREFIX: spec PREFIX: spi PREFIX: sta PREFIX: stead PREFIX: str PREFIX: stre PREFIX: strea PREFIX: stretch PREFIX: stru PREFIX: su PREFIX: sup PREFIX: sur PREFIX: sw Zach Bornheimer Nonstatistical Language-Blind Morpheme Extraction 18 PREFIX: swi PREFIX: swim PREFIX: t PREFIX: ta PREFIX: telescop PREFIX: tha PREFIX: thi PREFIX: thir PREFIX: tho PREFIX: thou PREFIX: thr PREFIX: thre PREFIX: thro PREFIX: tr PREFIX: tra PREFIX: tw PREFIX: twe PREFIX: twel PREFIX: twent PREFIX: u PREFIX: un PREFIX: v PREFIX: va PREFIX: ve PREFIX: vi PREFIX: vo PREFIX: vol PREFIX: w PREFIX: wa PREFIX: water PREFIX: we PREFIX: wh PREFIX: whi PREFIX: wi PREFIX: win PREFIX: wis PREFIX: wo PREFIX: ye STEM: For STEM: I STEM: It STEM: Mars STEM: Ogilvy STEM: Ottershaw STEM: The STEM: about STEM: after STEM: all STEM: are STEM: as STEM: be STEM: black STEM: bright STEM: by Zach Bornheimer Nonstatistical Language-Blind Morpheme Extraction 19 STEM: century STEM: cool STEM: darkness STEM: day STEM: distance STEM: do STEM: dust STEM: earth STEM: ever STEM: existence STEM: eye STEM: faint STEM: feeling STEM: for STEM: fro STEM: gas STEM: green STEM: grey STEM: he STEM: hour STEM: idea STEM: in STEM: inhabit STEM: intellects STEM: intelligence STEM: is STEM: it STEM: last STEM: life STEM: light STEM: lit STEM: man STEM: mean STEM: midnight STEM: miles STEM: million STEM: my STEM: nearer STEM: new STEM: night STEM: no STEM: of STEM: one STEM: or STEM: our STEM: ours STEM: paper STEM: papers STEM: part STEM: people STEM: planet STEM: pointed STEM: red STEM: regard Zach Bornheimer Nonstatistical Language-Blind Morpheme Extraction 20 STEM: remember STEM: round STEM: same STEM: seem STEM: since STEM: slow STEM: small STEM: space STEM: star STEM: still STEM: sun STEM: surface STEM: telescope STEM: that STEM: the STEM: them STEM: to STEM: under STEM: up STEM: war STEM: warm STEM: was STEM: world STEM: years SUFFIX: ated SUFFIX: e, SUFFIX: ed, SUFFIX: er, SUFFIX: ly SUFFIX: ous SUFFIX: ves
What's most interesting about this data is that it doesn't seem all that off for some sections and wildly off in others. Suffixes seems to be about 60% correct insofar as it discerned very familiar suffixes. Stems is 100% accurate. Prefixes and infixes are wildly off. This is in part due to the nature of meaning missing. Once software tries to assign meaning to each morpheme, it will facilitate eradication of false positives.
Zach Bornheimer Nonstatistical Language-Blind Morpheme Extraction 21 Future Ambitions For the future, it would make sense to rewrite parts of this algorithm to make use of a GPU through CUDA or OpenCL. This would allow most of the operations involving determining n- gram parallelism to happen simultaneously as opposed to serially. This savings would be exponential. In a 100,000 word corpus, there are 100,000! n-grams of size NGRAM_SIZE that are processed in total. Each of those n-grams could be processed simultaneously instead of sequentially, so it could, theoretically, take approximately the time for 1 n-gram of size NGRAM_SIZE to be processed in the time it would take ~2.82 " 10^456,573 n-grams to be processed (if there were no limits on GPU cores).
Another part that could be done for this is to make use of a job scheduler or to create a job scheduler which would allow you to have this program run on multiple computers simultaneously. The ideal host for this would be Amazon's EC2 service as they have the ability to do GPU processing in a High Performance Cluster of machines. Running a final program on this type of machinery would yield terrific results.
Finally, the most important aspect of my future ambitions would be to create a morphosyntactic tagger and gloss generator (semantics extractor) using a corpus and nothing else (semantics require semantic seed data as language and meaning were not associated in a bubble, but rather are the result of associations in the real world). These two tools are the next logical steps after this program as their code would function in a parallel manner to the morpheme extraction algorithm. Zach Bornheimer Nonstatistical Language-Blind Morpheme Extraction 22 References
Chomsky, N. (1956). Three Models For The Description Of Language. IEEE Transactions on Information Theory, 2(3), 113-124. Kerninghan, B. W., & Ritchie, D. M. (1998). The C programming language (2nd ed.). Englewood Cliffs: Prentice Hall. Kroah-Hartman, G. (2002, June 26). Kernel CodingStyle. The Linux Kernel Archives. Retrieved December 12, 2013, from https://www.kernel.org/doc/Documentation/CodingStyle