Towards The Sense Disambiguation of Afan Oromo Words Using Hybrid Approach (Unsupervised Machine Learning and Rule Based)

Towards the Sense Disambiguation Workineh T., Debela T. & Teferi K.
61
ORIGINAL ARTICLE
Towards the Sense Disambiguation of Afan Oromo Words

Using Hybrid Approach(Unsupervised Machine Learning
and Rule Based)
Workineh Tesema1, Debela Tesfaye1 and Teferi Kibebew1
Abstract
This study was conducted to investigate Afan Oromo Word Sense Disambiguation which
is a technique in the field of Natural Language Processing where the main task is to find
the appropriate sense in which ambiguous word occurs in a particular context. A word
may have multiple senses and the problem is to find out which particular sense is
appropriate in a given context. Hence, this study presents a Word Sense Disambiguation
strategy which combines an unsupervised approach that exploits sense in a corpus and
manually crafted rule. The idea behind the approach is to overcome a bottleneck of
training data. In this study, the context of a given word is captured using term co-
occurrences within a defined window size of words. The similar contexts of a given
senses of ambiguous word are clustered using hierarchical and partitional clustering.
Each cluster representing a unique sense. Some ambiguous words have two senses to the
five senses. The optimal window sizes for extracting semantic contexts is window 1 and 2
words to the right and left of the ambiguous word. The result argued that WSD yields an
accuracy of 56.2% in Unsupervised Machine learning and 65.5% in Hybrid Approach.
Based on this, the integration of deep linguistic knowledge with machine learning
improves disambiguation accuracy. The achieved result was encouraging; despite it is
less resource requirement. Yet; further experiments using different approaches that
extend this work are needed for a better performance.
Keywords: Afan Oromo, Ambiguous Word, Hybrid, Rule Based, Word Sense
Disambiguation.
_________________________________________________________________________
1
Jimma University, College of Natural Sciences, Department of Information Science,
Jimma, Ethiopia
Ethiop. J. Educ. & Sc. Vol. 12 No 1, September 2016 62
INTRODUCTION we have used the context information

In todays world, where World Wide Web derived from the corpus and ruled based
technology is keeping on growing very approaches. One of the good news for the
fast, many users go to the web to search for Afan Oromo disambiguation is that the
information, for entertainment, to read developed system was open to take any
documents and electronic books. ambiguity words provided by users unlike
Sometimes it is observed that the result of a other systems which are limited to a few
search is not appropriate. The reason words. It was integrated deep linguistic
behind is, there is an ambiguous word in knowledge with algorithms markedly
the query (Salton, 2015). Information improves disambiguation accuracy. Also
Retrieval (IR) can potentially benefit from argue that to find senses in Afan Oromo,
the correct senses of words provided by this work was significantly increased word
Word Sense Disambiguation (WSD). sense disambiguation performance and
Ambiguity is a cause of poor performance applicable for different applications like
in IR systems. The queries may contain machine translation. This sense
ambiguous words (terms), which have disambiguation has been used for many
multiple senses (Ide and Jean, 2010). NLP applications and pursued as a way to
Therefore, the objective of word sense improve retrieval systems, and generally
disambiguation (WSD) is to identify the get better information access. The
correct sense of a word in context. It is one motivation behind this study was to allow
of the most critical tasks in most natural the users to make ample use of the
language processing (NLP) applications, available technologies.
including information retrieval, information In Afan Oromo identifying the correct
extraction, and machine translation. senses of the ambiguous words is easy for
Natural languages have ambiguous words human being, basically, sometimes it is
which need to be disambiguated and thus difficult. However, it is too tough for the
the appropriate sense of an ambiguous machine to identify the correct sense of
word in a given context can be identified. these words. Nowadays, as the
Each word may have more than one sense development of technology is increasing
that is why a single word sometimes can rapidly, like Afan Oromo has also started to
have many senses (Sreedhar et al., 2012). use the technology for different purposes
An ambiguous word can take several (Nancy and Jean, 2007). Like other natural
senses depending on the context in which it languages in Afan Oromo there exists same
appears. The same form and pronunciation form of words, which has more than one
can take different senses in different different meaning. The challenge is to find
contexts (David and Radu, 2002). out which particular sense is appropriate in
This work assigns the sense and a given context. Sense disambiguation is
accomplished by using major sources of the problem of determining which sense of
information contained within the context a word is active in a particular context
(Adam, 2007). All disambiguation work for (Agirre and Martinez, 2000). Additionally,
Afan Oromo was used only machine the machines have no ability to decide such
learning either supervised or unsupervised an ambiguous situation unless some
methods. Still a few researches on sense practices have been planted into the
disambiguation for this language are focus machines memory (Shaikh Samiulla,
only on corpus information which is less 2013). Here the bag of words is the context
accurate and limited to disambiguate few that considered as words in some window
ambiguity words. However, in this study, surrounding the ambiguous word in terms
Towards the Sense Disambiguation Workineh T., Debela T. & Teferi K. 63
of a distance it has from ambiguity. The contexts give us the clue of information
surrounding word of the ambiguous word what the sense of this ambiguity word is. It
decides correct choice of word. Assume, seems that this ambiguity word identifies
the ambiguity word afaan has surrounded its senses with the help of its contexts as
with the following contexts. So, these shown the in Figure 1 below:
Figure 1. Contexts Identification
The work done by (Kebede, 2013) on Word which its performance was not much
Sense Disambiguation for Afaan Oromo surprise unlike hybrid approaches.
language is based on an annotated corpus.
However, the performance of the system However, in our case the system was open
was limited to only five ambiguities words to disambiguate any ambiguity word given
(such as sanyii, karaa, horii, sirna and by users rather than limited to few
qoqhii). Hence, he manually annotates the ambiguity words. Since, the developed
corpus, the approach was exposed to a corpus was large in size it is suitable to
bottleneck of training data because of there capture the surrounding N contexts to left
is no standardized labeled dataset for this and right unlike the supervised approach.
language. Additionally, his work was only Additionally, different algorithms and
relied on the machine learning approach developed hybrid (rules + machine
learning) are used because it overcomes the
limitation of algorithms by rule based an improvement in assigning correctly the

(linguistic knowledge). This makes corresponding disambiguation over the
integration of different algorithms and rules baseline method (Francisco et al., 2005).
which was suitable for Afan Oromo which
has lack of tools and resources. Therefore, An unsupervised machine learning
unlike the study conducted by Tesfa, our approach for Amharic using five selected
study focused on hybrid approaches which algorithms was used; these are Simple k-
is unsupervised machine learning and ruled means, EM (Expect Maximization) and
based to overcome the limitation of agglomerative single, average and
machine learning. complete link clustering algorithms. The
The unsupervised machine learning for tested unsupervised machine learning
sense disambiguation was trained on method that deals with clustering of
unannotated corpus, rivals that the contexts for a given word that express the
performance of supervised techniques same sense. The work concluded that
which requires time-consuming hand simple k means and EM clustering
annotations (Yarowsky, 2007). So, algorithms achieve higher accuracy on the
unsupervised Word Sense Disambiguation task of WSD for selected ambiguous word,
method is based on unlabeled corpora, and provided with balanced sense distribution
does not exploit any manually sense-tagged in the corpus (Wassie, 2014).
corpus to provide a sense choice for a word The sense of the words is extracted based
in context unlike supervised method on live contexts using supervised and
(Roberto, 2009). It has the potential to unsupervised approaches. Unsupervised
overcome the knowledge acquisition approaches use online dictionary for
bottleneck (Yarowsky, 2007), that is, the learning, and supervised approaches use
lack of large-scale resources manually manual learning sets. Hand tagged data are
annotated with word senses. This approach populated which might not be effective and
to WSD is based on the idea that the same sufficient for learning procedure. This
sense of a word has similar neighboring limitation of information is the main flaw
words. Its able to induce word senses from of the supervised approach. The developed
input text by clustering word occurrences, approach focuses to overcome the
and then classifying new occurrences into limitation using a learning set which is
the induced clusters. It does not rely on enriched in a dynamic way of maintaining
labeled training text and, in their purest new data. The trivial filtering method is
version; do not make use of any machine- utilized to achieve appropriate training
readable resources like dictionaries, data. The mixed methodology having rule
thesauri and ontology. based approach and Bag-of-Words having
The information about the ambiguous word enriched bags using learning methods. The
to be disambiguated, words that are approach establishes the superiority over
syntactically related, and words that are individual rule based and Bag-of-Words
topically related to the ambiguous word approaches based on experimentation
(Jurafsky and Martin, 2009). The limited (Ranjan Pal et al., 2013).
availability of resources in the form of Rule based approach exploit the hand craft
digital corpora and annotated, the rule rule for WSD task. The rule based require
based method is applied. All senses are extensive work of expert linguists and thus
discovered using a set of rules and can result in near human accuracy. The
knowledge base for later use in the (Tesfaye, 2010) Afan Oromo rule based
disambiguation process. The hybrid shows Afan Oromo Grammar Checker, showed a
promising result. The results show that rule
based is an approach used in the combination with the manually crafted

morphologically rich language like Afan rules for clustering similar contexts of
Oromo. This rule based approach for ambiguous word and extracting the
languages, such as Afan Oromo, advanced contexts respectively. The motivation
tools has been lacking and are still in the behind the use of hybrid is mainly aroused
early stages. In this work, (Stefan et al., form the fundamental problem of corpus-
2011) a model that represents word sense based approach in relation to the sparseness
in context by vectors which are modified of the training contexts. The idea of this
according to the words in the targets research is therefore to combine both the
syntactic context (Adam, 2007). rule based and unsupervised machine
learning approaches into a hybrid approach.
Such method of word sense disambiguation
OVERVIEW OF AFAN OROMO as also employed in this work, combine the
Afan Oromo, also called Oromiffaa or advantages from machine learning and rule
Afaan Oromoo, is a member of based, potentially yielding better results.
the Cushitic branch of the Afro-Asiatic Hence, the reason why the hybrid approach
language family (Gragg and Gene, 2006). It was used is taking the availability and
is the third most widely spoken language in reliability of linguistic knowledge on the
Africa, after Hausa and Arabic. Its original top of the semantic techniques and training
homeland is an area that includes much of methods learned from a corpus for learning
what is today Ethiopia and some parts of the role of the words in its context.
other East African countries like Hence, there is no annotated corpus for
northern Kenya, Somalia, Tanzania and Afan Oromo the study was limited to use
Sudan (Abera, 2001). Currently, it is an free developed corpus (Unsupervised
official language of Oromia Regional State Method, which is suitable when there is
(which is the biggest region among the scarcity of training data). However,
current Federal States in Ethiopia). It is Unsupervised Method was enough than
used by Oromo people, who are the largest other methods when there is a small
ethnic group in Ethiopia, which constitute training dataset. Contrary, the hybrid
about 50% according to the estimate of method was more confortable than
2007 (Gragg and Gene, 2006). With regard Unsupervised Method due to it combines a
to the writing system, Qubee (a Latin-based set of rules with machine learning.
alphabet) has been adopted and become the Therefore, the integration of both methods
official script of Afan Oromo from 1991 was overcome the problems of each other
(Guya, 2003). As this language has more and improve the performance of the system
than half of the countrys population, there For implementation of the study, we have
are no standard dataset and natural used Java; Net Beans 8.0.2 which runs on
language tool. This study was one of the the prepared corpus and the clustering were
contributors for this under resourced performed in Weka 3.6.5 tool (compatible
language. with Java package). This package is a
general-purpose and open source
programming language. Moreover, it is
METHODS AND MATERIALS optimized for program portability and
This section describes the methods and component integration. This makes more
tools employed in this study. The study prefer for the study than other software
relies on the patterns learned from the packages.
corpus (unsupervised approach) in
Dataset SENSE DISAMBIGUATION

For this study, new corpus prepared for the ALGORITHMS
Sense Disambiguation of Afan Oromo. The developed disambiguation model for
However, in such a work, it is very difficult Afan Oromo involve followed three step
to obtain standardized dataset for under process:
resourced language like Afan Oromo. The
procedure for collecting and preprocessing a. Text preprocessing which takes
corpus is described here below: input and corpus, tokenize to
remove stop words and perform
normalization.
Corpus Preprocessing and Acquisitions b. Extract context terms providing
The corpus acquisition and preparation are clues about the senses of the
set of techniques required for gathering and ambiguous term using two
compiling data for training and testing techniques (window
algorithms. The lack of resource has led to size and rule based).
use of unannotated raw corpus to perform
hybrid disambiguation. It should be noted c. Clustering to group similar
that unsupervised disambiguation cannot context terms of the given
actually label specific terms as a referring ambiguous terms, the number of
to a specific concept that would require clusters representing the number
more information than is available. of senses encoded by the
The mechanism to acquire resource is to ambiguous term. In order to
use the data from various sources and cluster similar context terms the
hence it has not previously used in any degree of similarity computed
research on Afan Oromo. The collected using the vectors constructed from
data are machine readable free text. It is co-occurrence information.
collected from newspapers (Bariisaa,
Kallacha Oromiyaa and Oromiyaa. Bariisaa As described in the above, one of the
is a weekly newspaper, whereas the rest method used is unsupervised machine
two come out once in two weeks), learning:
bulletins, news (ORTO), government and
official websites. Moreover, Oromia Radio i. In the unsupervised machine learning,
and Television Organization (ORTO) the surrounding contexts were
found in Adama releases daily news extracted by sliding a window of n
through radio and television broadcast and words. The words that occur in similar
on its official website (Debele, 2014). To contexts tend to have similar senses. In
reduce the data sparse, the data used from order to extract the contexts from a set
these sources since they are believed to of sentences the role of the window is
represent texts addressing various issues of great. The senses and contexts can be
the language. Actually, the collected data captured in terms of the frequency co-
were not directly used for the purpose. occurrence neighborhood, i.e. words
co-occurring.
Context is the only means to identify the cluster representing the number of senses
sense of an ambiguous word. The context assumed by the ambiguous word. The
window size defines the size of the window underlying idea of the clustering of word
of context. A window size of N means that contexts provides a useful way to discover
there will be a total of N words in the semantically related senses.
context window. In order to disambiguate a
given word, a small and wider context For each context extracted, vector space
should be considered in the performance of matrix constructed from co-occurrences.
the system to rise overall. After the co-occurrence matrix, the cosine
similarity was computed based on the angle
Once the context words are extracted, the between vectors of the contexts. These
next step cluster similar contexts based on cosine similarity values were used to
their inherent semantics, the number of the cluster similar contexts.
The context terms of the ambiguous words The hybrid approach constitutes the
cluster using their similarity values unsupervised approach to cluster the
produced. The clustering algorithms used contexts followed by hand crafted
in this study are hierarchical agglomerative rule to extract the modifiers of the
clustering, which include single link, ambiguous word. The modifiers
complete link, average link and EM and K- have a great role to decide on the
means clustering from partitional word sense according to its role in
clustering. the sentence. The modifiers can
appear before the target word (the
ii. The other method used in this study word, it modifies or describe).
was the hybrid machine learning
approach. In this hybrid approach, In Afan Oromo, the words preceding a
the machine learns by the help of specific word are more likely to influence
manually developed rule approach. the sense of a word.
For example, [Mucayyoon ija akka boqqoolloo qabdi].
Disambiguation is done by analyzing the If ambiguous word preceded by

linguistic features of the word and its modifiers, then collect the
preceding word. The rule-based section of modifiers to disambiguate.
this approach disambiguates word If ambiguous word is a noun, the
automatically using rules in order to modifiers immediately following
complement the features learned from ambiguous Word.
training data. This information is coded in If the ambiguous word was a verb,
the modifiers immediately
the form of rules. Based on this notion, the preceding ambiguous word.
rule was developed was as follows:
Figure 2. Architecture of the System
IMPLEMENTATIONS corpus is prepared for the purpose of this

In this work, two types of data were used: study as there is no standard corpus for
(a) the small list of ambiguous words to test Afan Oromo language. However, it is
the algorithms. Fifteen (15) highly frequent unlabeled data and never used in any
ambiguous words were selected from the researcher and it needs to be developed
language speakers using the questionnaire, further.
(b) the big corpus containing thousands of The corpus, which is a set of sentences first
sentences to extract contexts and its vectors tokenized into words. Since, Afan Oromo
to represent the group of contexts. This is Latin alphabet the sentences can split
using similar word boundary detection them into lowercase. To disambiguate the
techniques like white space. After given ambiguous word, it takes one
tokenization takes place, we have removed ambiguous word at a time in the interface
Afan Oromo stop words (non-content provided and produces the cosine similarity
bearing words); hence it has no effect on to cluster. Then for clustering purpose, the
the sense of the words. Finally, some Weka tool is used to cluster the cosine
characters of the same words are similarity planted to it. The following was
sometimes represented in uppercase or the snapshot of the Word Sense
lowercase in the corpus as well as in the Disambiguation system interface:
user input and hence we have normalized
Figure 3. User Interface of the WSD
words. In order to measure the performance

RESULTS
of the window size for the disambiguation,
The system provides context terms to the
we run the disambiguator using all the
left and right side of ambiguous word
window sizes (starting from 1 to 10) and
(excluding stop words) based on the
observe the differences in the performance
provided window size. In this investigation,
of the disambiguator.
window size of up to 10 words on both
As evidenced by the experiment,
sides of ambiguous word has been tried for
differences in the window sizes yield
Afan Oromo. Starting with the first term in
different results. The result has proved that
the test set, extract words appearing N-
window of two-words to the left and right
words left and right of the ambiguous
of the ambiguous word achieved the best (Jurafsky and Martin, 2009). Table 1 below
performance than other windows. This shows the performance of the disambiguate
result converges with the result obtained by using different window sizes.
Table 1. Determining Optimal Window Size
Window Size (N) Accuracy (%)

1 73.34
2 66.67
3 60.0
4 55.5
5 51.6
6 46.3
7 41.1
8 36.0
9 31.8
10 28.6
As can be seen from Table 1.1, a narrow It is very likely that smaller window sizes
window of context, one and two words to have yielded significantly higher accuracy
either side (73.34% and 66.67% than other windows and different windows
respectively), was found to perform better gave different results. From this
than wider windows (28.6%). The accuracy experiment, we conclude that smaller
is conducted by measuring the performance window sizes usually lead to accuracy
of WSD with varying the window size (the while bigger window sizes relatively low
tested ambiguous word found in sentences). accuracy (Debele, 2014).
Table 2. Unsupervised and Hybrid WSD Results
Clustering Accuracy (%)

Algorithms Unsupervised Machine Learning Hybrid Approach
Single Link 54.4% 61%
Complete Link 54.4% 59.7%
Average Link 54.4% 61%
K-Means 56.9% 71.2%
EM 60.7% 74.6%
compared with results discussed so far. It is
From the finding of the above table, the especially interesting that using the
addition of deep linguistic knowledge to a preceding modifiers of the ambiguous word
WSD system is a significant rise in perform better result. The modifiers contain
disambiguation accuracy and coverage as
a lot of valuable clues for disambiguation contexts gaara and tabba are grouped to
(Gamta, 2005). make the sense of highland and cluster 2 is
daara and uccuu are merged to make the
One thing that is clear from the experiment sense of cloth, the other two clusters:
is that the senses are clustered where cluster 3 and cluster 4 are incorrectly
clustered and cannot make a sense. The
cluster 0 is the pair of contexts which are figure 4 below Dendrogram shows the
bilisa and qabsoo clustered and make the more description of these results:
sense of got freedom, cluster 1 is the pair of
Figure 4. Dendrogram of the Senses Clustered for Bahe
Evaluation of Sense Disambiguation was undertaken on the basis of precision

System and recall. Precision is defined as the
Hence, there were no previous standard percentage of correctly disambiguated
Afan Oromo word sense disambiguation words out the total of disambiguated
dataset for evaluation as presented in words. Recall is defined as the percentage
corpus preparation Section. For this reason, of correctly disambiguated words out of the
the system did not evaluate against the total number of ambiguous words
other systems. In this work, the evaluation (Flickinger, 2015).
# Correctly Disambiguated Words

Precision (%) = __________________________
# Disambiguated Words
# Correctly Disambiguated Words

Recall (%) = ___________________________
# Total number of Ambiguous Words
are dealing with Vector Space Model

In addition to Precision-Recall measures, (cosine similarity) and clustering
the researchers have used extra evaluation, (Euclidean distance) (Kaplan, 2015). By
particularly in Hybrid method and using cosine similarity we include
Clustering as it clearly shown in the figure important semantic information in the
purely statistical process of selecting the
2 system architecture. The reason why this appropriate sense of a particular word. This
evaluation mechanism needed is that on the benefits both unsupervised, hybrid
rule based to identify whether the modifiers approaches to WSD by increasing the
preceding ambiguous words are either chances of matching a particular context.
Nouns or Verbs. And also to evaluate how The result found that using a window size
much the produced clusters are comply of 2 words either side of the target word
with the clusters prepared by human offered the accuracy of disambiguation
experts as a benchmark. The researchers than using the whole sentence. Therefore;
achieved, how many of the clustered smaller values of the window size, which
contexts are correct, i.e. to evaluate if all leads to the proper choice of sense of the
the similar contexts of the ambiguous target word. Based on this result, we
words are placed in the same group. conclude that for Afan Oromo window 2
As the result shows in Table 2, the actual was recommended unlike Amharic
result of the study was 85.5%, which is language (Solomon, 2011) which window
encouraging with the under resourced size of 3 is recommended.
language that of Afan Oromo language. As shown in table 2, the result obtained by
The overall performance of the system on unsupervised machine learning and hybrid
the machine learning and rule based yields approach was different as the semantic
that better achievement than before. An information extracted by the algorithms is
important point here is the study decide distinct from the rule. However, the
good clustering, since it is commonly inclusion of unsupervised machine learning
acknowledged that there is best criterion of only as features does not always improve
the final aim of the clustering. performance. This is (Kaplan, 2015) that
machine learning algorithms were a useful
information source for disambiguation but
DISCUSSION that its not as robust as a linguistic
The conducted experiment shows that, the (modifiers in this case). The most likely
semantic has come to the conclusion that reason for this is that our approach relies on
the sense of words are closely connected to automatically assigned immediately
the statistics of word usage, which are preceding words while machine learning
working with window size and vector value are needs to left and right of unannotated
derived from event frequencies; that is, we data set. On the other hand, the machine
learning is noisy while the rule is more The WSD developed for Afan Oromo has
reliable and prove to be a most useful its own strength and weakness sides. As the
linguistic knowledge for WSD. result showed that, the experiment attempts
As the conducted experiment showed, each to disambiguate any ambiguous words, if
cluster has context group, where the sense its running in corpus rather than limiting
of these context groups is hopefully itself to treating a restricted ambiguous
different. The underlying assumption is that word. This is one of the strongest sides of
the senses found in similar contexts are this WSD. It is argued that this approach is
similar senses. Then, new occurrences of more likely to assist the creation of
the context can be classified into the closest practical systems.
induced clusters (senses). All contexts of This system has the first work, that
related senses are included in the clustering integrated different algorithm to find the
and thus performed over all the contexts in appropriate sense of ambiguous words in
the sentences. The underlying hypothesis is Afan Oromo. However, there is some
that ambiguous word contexts clustering ambiguous word on which the performance
captures the reflected unity among the of our approach is actually low. The system
contexts and each cluster reveal possible reported that the vector space model was
relationships existing among these contexts affected by the data sparsity. The frequency
as seen in table 2 (Fei Shao and Yanjiao of co-occurrence of most context words is
Cao, 2005; Tesfaye, 2011). zero due to the limited corpus size of the
From Table 2, the hybrid approach WSD is language. This result affects our cosine
an encouraging result than unsupervised similarity values (as shown in figure 3
machine learning approach. This is due to above). The other weakness of the system
unsupervised approach is not as hybrid is context clustering in hierarchical
approach, especially hierarchical clustering clustering which was a noisy and yielded
result was noisy. As already discussed low performance as compared to K-Means
before, the obtained result in both and EM clustering (as shown in Table 2).
approaches was different. Therefore, the
linguistic knowledge (hybrid approach) the
best approach to solve WSD than machine CONCLUSION
learning algorithms in Afan Oromo (Ravi WSD has been based on the idea that the
Mante et al., 2014; David, 2014) as shown semantics of the context words belonging
in the experiment (Table 2). However, the in the same sense of a word will have
overall system performance gained thus far similar neighboring words. The context is
is not surprising since this language was hence a source of information and is the
under resourced materials and tools. only means to identify the sense of an
From the finding of this experiment the ambiguous word. The approach does not
addition of deep linguistic knowledge to a rely on labeled training text and does not
WSD system is a significant rise in make use of any expensive resources like
disambiguation accuracy and coverage as dictionaries, thesauri, and WordNet (Adam,
compared with results discussed so far. It is 2007).
especially interesting that using the For under resourced Ethiopian language
preceding modifiers of the ambiguous word like Afan Oromo the hybrid approach is
perform better result. We can conclude that recommended. Since there is no annotated
modifiers contain a lot of valuable clues for corpus, hybrid approach plays a great role
disambiguation (Gamta, 2005). to disambiguate. The hybrid approach
relies on hand-constructed rules that are
acquired from language specialists rather David Yarowsky and Radu Florian (2002)
than automatically trained from data. In Evaluating sense disambiguation
this study, we faced a significant challenge across diverse parameter spaces.
as Afan Oromo lacks Word Net, Sense Natural Language Engineering.
Definition and annotated resources. Taking
into account their contribution to WSD and Debele G. (2014) Afan Oromo News Text
other research concerned institutions Summarizer, Masters thesis,
should develop these resources. Pohang University of Science
and Technology, Pohang: Korea.
LIST OF ABBREVIATIONS Fei Shao, Yanjiao Cao(2005) A New Real-

IR: Information Retrieval time Clustering Algorithm,
NLP: Natural Language Processing Department of Computer Science
WSD: Word Sense Disambiguation and Technology, Chongqing
University of Technology
Chongqing China. Linguistics:
ACKNOWLEDGMENT Linguistic Studies in Honour of
I would like to acknowledge Jimma Jan Svartvik, London: Longman.
University for financial support in this Flickinger, D. (2015) Natural Language
work. Secondly, I would like to thank Engineering-Efficient Processing
Oromia Radio and Television (ORTO) with HPSG: Methods, Systems,
agency for allowance of the corpus/data Evaluation Coli website: from:
collection from their studio. http://www.coli.uni-sb.de/nlesi/ on
3/15/2015.
Francisco Oliveira, Fai Wong, Yiping Li,

REFERENCES Jie Zheng (2005) Unsupervised
Abera, N. (2001) Long vowels in Afan Word Sense Disambiguation and
Oromo: A generic approach, Rules Extraction using non-
School of graduate studies, Addis aligned bilingual corpus, Tsinghua
Ababa University, Ethiopia. University: Beijing.
Adam, Kilgarriff (2007) I dont believe in Gamta T. (2005) Seera Afaan Oromoo,
word senses: Computers and Finfinnee, Boolee Press.
Humanities, London:
Longman. Gragg and Gene B.(2006) Oromo of
Wollega: non-semetic languages
Agirre, E. and Martinez, D.(2000) of Ethiopia , East lansing,
Exploring automatic word sense Michigan state University press.
disambiguation with decision lists
and the web, Toronto, Ontario. Guya T. (2003) CaasLuga Afaan Oromoo:
Jildii-1, Gumii Qormaata Afaan
David Yarowsky (2014) Decision lists for Oromootiin Komishinii Aadaa fi
lexical ambiguity resolution: Turizimii Oromiyaa, Finfinnee.
application to accent restoration
in Spanish and French, USA: Ide, Nancy and Jean Vronis (2010) Word
Stroudsburg.
sense disambiguation: The state Salton G. (2015) The Measurement of

of the art, Computational Term Importance in Automatic
Linguistics. Indexing: In Journal of the
American Society for Information
Jurafsky, D., Martin, J. (2009) Speech and Science, vol.32.
Language Processing (2nd
Edition) Pearson Education. Shaikh Samiulla Zakirhussain (2013)
Unsupervised Word Sense
J. Sreedhar, S. Viswanadha Raju, A. Disambiguation, Indian Institute
Vinaya Babu, Amjan Shaik, P. of Technology: Bombay.
Pavan Kumar (2012) Word
Sense Disambiguation: An Solomon, A.(2011) Unsupervised Machine
Empirical Survey. Hyderabad: Learning Approach For Word
India. Sense Disambiguation To
Amharic Words, Addis Ababa
Kaplan,A. (2015) An experimental study University. Addis Ababa:
of ambiguity and context, Ethiopia.
Mechanical Translation.
Stefan Thater, Hagen Frstenau, and
Kebede T. (2013) Word Sense Manfred Pinkal (2011)
Disambiguation for Afan Oromo Contextualizing semantic
Language, Addis Ababa representations using syntactically
University, Addis Ababa, enriched vector models: In
Ethiopia. Proceedings of the 48th Annual
Meeting of the Association for
Nancy and Jean, (2007) Word Sense Computational Linguistics.
Disambiguation Algorithms and Uppsala: Sweden.
Applications, Ney work, Spring.
Tesfaye D. (2010) Designing a Stemmer
Ranjan Pal A., Anirban Kundu, Abhay for Afan Oromo Text: A hybrid
Singh, Raj Shekhar, Kunal Sinha approach, school of graduate
(2013) A Hybrid Approach to studies, Addis Ababa University,
Word Sense Disambiguation Ethiopia.
Combining Supervised And
Unsupervised Learning. India: Tesfaye D. (2011) A Rule-based Afaan
West Bengal. Oromo Grammar Checker, Addis
Ababa University, Addis Ababa ,
Ravi Mante, Mahesh Kshirsagar and Ethiopia.
Prashant Chatur (2014) A Review
of Literature on Word Sense Wassie G. (2014) A Word Sense
Disambiguation, Government Disambiguation Model for
college of engineering Amravati: Amharic Words using
Maharashtra. Semi-Supervised Learning
Paradigm, School of graduate
Roberto Navigli (2009) Word Sense studies. Addis Ababa
Disambiguation: A Survey, ACM University: Ethiopia.
Computing Surveys, USA:
Washington Dc. Yarowsky, D. (2007) Unsupervised word
sense disambiguation rivaling

supervised methods, In
Proceedings of the 33rd Annual
Meeting of the Association for
Computational Linguistics.
Cambridge: M.A.

Towards The Sense Disambiguation of Afan Oromo Words Using Hybrid Approach (Unsupervised Machine Learning and Rule Based)

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Towards The Sense Disambiguation of Afan Oromo Words Using Hybrid Approach (Unsupervised Machine Learning and Rule Based)

Загружено:

Авторское право:

Доступные форматы

Towards the Sense Disambiguation Workineh T., Debela T. & Teferi K.

Towards the Sense Disambiguation of Afan Oromo Words

Workineh Tesema1, Debela Tesfaye1 and Teferi Kibebew1

INTRODUCTION we have used the context information

Figure 1. Contexts Identification

limitation of algorithms by rule based an improvement in assigning correctly the

based is an approach used in the combination with the manually crafted

Dataset SENSE DISAMBIGUATION

For example, [Mucayyoon ija akka boqqoolloo qabdi].

Disambiguation is done by analyzing the If ambiguous word preceded by

Figure 2. Architecture of the System

IMPLEMENTATIONS corpus is prepared for the purpose of this

Figure 3. User Interface of the WSD

words. In order to measure the performance

Table 1. Determining Optimal Window Size

Window Size (N) Accuracy (%)

Table 2. Unsupervised and Hybrid WSD Results

Clustering Accuracy (%)

Single Link 54.4% 61%

Complete Link 54.4% 59.7%

Average Link 54.4% 61%

K-Means 56.9% 71.2%

Figure 4. Dendrogram of the Senses Clustered for Bahe

Evaluation of Sense Disambiguation was undertaken on the basis of precision

# Correctly Disambiguated Words

# Correctly Disambiguated Words

are dealing with Vector Space Model

LIST OF ABBREVIATIONS Fei Shao, Yanjiao Cao(2005) A New Real-

Francisco Oliveira, Fai Wong, Yiping Li,

sense disambiguation: The state Salton G. (2015) The Measurement of

sense disambiguation rivaling

Вам также может понравиться