Вы находитесь на странице: 1из 5

A Fast Document Classification Algorithm for Gene Symbol Disambiguation

in the BITOLA Literature-Based Discovery Support System

Andrej Kastrin1 and Dimitar Hristovski, PhD2

Institute of Medical Genetics, University Medical Centre, Ljubljana, Slovenia
Institute of Biomedical Informatics, Faculty of Medicine Ljubljana, Slovenia

Abstract title and abstract fields of MEDLINE citations. To

manage ambiguous gene symbols in the universe of
Gene symbol disambiguation is an important MEDLINE citations,3 an algorithm to classify
problem for biomedical text mining systems. When citations as being in the genetic domain and thereby
detecting gene symbols in MEDLINE® citations one to filter out citations not in the genetic domain was
of the biggest challenges is the fact that many gene needed. We refer to the genetic domain as a subset of
symbols also denote other, more general biomedical MEDLINE citations in which occurrences of gene
concepts (e.g. CT, MR). Our approach to this symbols are more probable than in any other subset
problem is first to classify the citations into genetic of citations. Our problem can be formally described
and non-genetic domains and then to detect gene as a task of assigning a MEDLINE citation to genetic
symbols only in the genetic domain. We used or non-genetic domain, based on its content.
ontological information provided by Medical Subject
Headings (MeSH®) for this classification task. The In the literature many methods, such as Support
proposed algorithm is fast and is able to process the Vector Machines, k-Nearest Neighbor, and Naïve
full MEDLINE distribution in a few hours. It Bayes, have been applied and have shown different
achieves predictive accuracy of 0.91. The algorithm performance. Practically all existing methods use the
is currently implemented in the BITOLA literature- results of preliminary training. In this paper we
based discovery support system (http://www.mf.uni- consider a novel and fast unsupervised corpus-based
lj.si/bitola/). classification method that needs more limited
information for decision making. The proposed
Introduction approach is fast, simple to implement and can be
easily integrated into existing systems. It is able to
The exponential growth of the life sciences literature process the full MEDLINE distribution in a few
makes it difficult even for experts to absorb all the hours. It also allows to tune the trade-off between
relevant knowledge in their specific field of interest. precision and recall by changing a single threshold
Sophisticated technologies are needed for effective parameter.
data acquisition. In contrast to information retrieval
and information extraction, which finds information Background and Related Work
explicitly stated in the literature, the aim of literature-
based discovery is to infer novel and potentially Word Sense Disambiguation (WSD) is the process of
meaningful knowledge that is implicitly stated within determining which of the senses of an ambiguous
bibliographic databases such as MEDLINE. The idea word should be invoked in a particular use of the
that the biomedical literature contains unnoticed but word.4 It is crucial in the life sciences domain for
discoverable connections was first introduced by improving performance of text mining systems and
Swanson,1 suggesting the therapeutic effect of fish information retrieval in particular. A special field of
oil on Raynaud’s disease. Several literature-based WSD is Gene Symbol Disambiguation (GSD).5 For
discovery systems were developed inspired by example, in the sentence6 ‘The inverse association
Swanson’s original idea. between MR and VEGFR-2 expression in carcinoma
suggest a potential tumor-suppressive function for
In contrast to other approaches, Hristovski et al.2 MR’ we need to decide if ‘MR’ stands for
applied association rule mining to describe ‘mineralocorticoid receptor’ gene or ‘magnetic
relationships between biomedical concepts and resonance’ imaging.
developed a system called BITOLA. Although
BITOLA can be used as a general biomedical Gene symbol ambiguity is difficult to manage with
discovery support system, it is especially useful for conventional string matching techniques. However,
describing new relations between diseases and genes. previous research has shown that the ambiguity
In BITOLA we extract the gene symbols from the problem is complicated because a gene term (i) may
refer to different species, (ii) may denote a gene or

AMIA 2008 Symposium Proceedings Page - 358

another type of biomedical concept, (iii) may refer to A subset of articles, which we called a genetic
the gene or its products, or (iv) may refer to different domain corpus, was extracted from the background
genes. corpus to represent genetically relevant citations. To
accomplish this, the ‘gene2pubmed’ file from the
Although, numerous methods, techniques, and
Entrez Gene repository14 was downloaded and used
algorithms have been developed to address GSD, the
as a reference list for identifying MEDLINE citations
problem is still challenging. Weber7 developed a test
in which gene symbols occur. A second frequency
collection for the purpose of supervised learning in
count file was then created to provide a frequency
the biomedical language domain. Savova et al.8
distribution of MeSH descriptors in the genetic
addressed the ambiguity with unsupervised clustering
domain corpus.
of all MEDLINE abstracts. Schijvenaars et al.9
reported on a fast and scalable thesaurus-based Scoring algorithm
disambiguation algorithm that requires very little
For each MeSH descriptor in two frequency lists we
training data. Humphrey3,10 proposed Journal
applied the chi-square test to obtain a value which
Descriptor Indexing, an unsupervised method based
shows whether there is a statistically significant
on statistical associations between words in a training
difference between the observed frequencies in the
set of MEDLINE citations and a small set of journal
background and genetic domain corpus. The chi-
descriptors (broad MeSH descriptors assigned to
square value is calculated from the following
journals per se) to resolve ambiguity. Liu et al.11
presented a two-phase unsupervised method to build
a classifier for an ambiguous term, where the first (OB  E B ) 2 (OG  EG ) 2
phase compiles a sense-tagged corpus for a particular X2  .
term, and the second phase builds a classifier on a
sense-tagged corpus. Xu et al.12 generated gene EB and EG are expected frequencies and are
profiles from different knowledge sources and then calculated as
applied profiles to disambiguate gene symbols.
Recently, Farkas13 reported the utility of co- N B (OB  OG ) N G (OB  OG )
EB and EG ,
authorship network analysis for GSD. N B  NG N B  NG
The remainder of the paper is organized as follows. where NB and NG is the total frequency of MeSH
First we introduce the general concept of our descriptors in a background and genetic domain
classification approach. Next we present corpus, respectively.
experimental results, and finally concluding remarks
and ideas for future work are provided. The MeSH descriptor was considered to be a positive
indicator if its relative observed frequency in the
Methods genetic domain corpus was greater than its relative
observed frequency in the background corpus.
Knowledge sources
The classification algorithm takes two inputs:
Each MEDLINE citation is manually assigned
around 12 MeSH descriptors by trained indexers. The 1. A frequency profile table of all the MeSH
2008 MeSH, which was used in this study, contains descriptors with significant chi-square scores
24,767 descriptors. (X > 3.84), noting which descriptors are
positive indicators and which are negative.
Our statistical procedure requires a background and a
genetic domain dataset. In order to obtain this, we 2. A set of citations to be classified.
processed the full MEDLINE Baseline Repository, Descriptors that appear highly frequently (e.g.
up to the end of 2007, which contains 16,880,015 Humans, Animals, Mice, etc.) and are thus not
citations. As the distribution is in XML format, we meaningful to the algorithm were removed. We built
extracted the relevant elements and transformed them the stop word list based on MEDLINE check tags.
into a relational text format (i.e. one line for each When differences between corpora have been
MeSH descriptor occurrence in each citation). A evaluated, results are put in a frequency profile table
frequency count file was compiled to provide a having two columns, one for the background and
background frequency distribution of MeSH another for the genetic domain corpus. This process
descriptors in the whole MEDLINE corpus. is illustrated in Figure 1. An example of the
frequency profile table is presented in Table 1.

AMIA 2008 Symposium Proceedings Page - 359

Table 1. The information related to the frequency profile table.
MeSH Descriptor OB OG EB EG X Ind
Amyotrophic Lateral Sclerosis 6133 224 6196 161 25 +
Animals 3615845 133463 3654231 95077 15901 +
Computer Simulation 36544 326 35935 935 407 
Elasticity 16272 48 15906 414 332 
Genes, Recessive 10948 1269 11907 310 3047 +
Guanine Nucleotide Exchange Factors 1568 685 2196 57 7080 +
Humans 8544765 108109 8433449 219425 57941 
Mice 691495 77905 749889 19511 179314 +
Models, Chemical 29588 608 29430 766 33 
Models, Molecular 68781 5425 72324 1882 6845 +
Models, Statistical 19630 60 19191 499 397 
Mutation 178885 23627 197377 5135 68316 +
Polymers 26724 170 26212 682 394 
Stress, Mechanical 25706 320 25366 660 180 
Surface Tension 2976 12 2912 76 55 
Note: OB – observed frequency in background corpus; EG – expected frequency in genetic domain corpus; X – chi-
square statistic; Ind – indicator.
Performance evaluation

Medline Baseline
Calibration and validation sets were needed in order
Entrez Gene
to define a threshold between genetic and non-
genetic domain citations and to evaluate the
performance of our algorithm. Two independent sets
of 100 citations each were randomly selected from
Genetic Domain
MEDLINE and manually annotated by two
annotators with biological domain knowledge. Their
task was to identify citations as being in either the
MeSH Descriptor Extraction MeSH Descriptor Extraction genetic or non-genetic domain. They achieved kappa
& &
Frequency Count Frequency Count scores for inter-annotator agreement of 0.81 and 0.91
for the calibration and validation set, respectively.
Consensus voting was then used to achieve complete
Frequency Profile Chi-Square
agreement between judges. Both sets are available on
Table Statistic
Figure 1. Flow diagram for the preparation of To draw a boundary between genetic and non-genetic
frequency profile table from knowledge sources. domain citations, we plotted predictive accuracy
against score values, and the threshold parameter was
set to maximize accuracy. Predictive accuracy is the
The algorithm proceeds by reading each MEDLINE overall correctness of the prediction and was
citation c in turn and assigning a decision score to it calculated as the sum of correct classifications (TP
as follows: and TN) divided by the total number of
Score (c) = 0 classifications (TP + FP + FN + TN). In this formula,
For each MeSH descriptor d TP is the number of true positives (citations are about
If d is a positive indicator genetics and are classified into the genetic domain);
Score (c) = Score (c) + 1 FP denotes the number of false positives (citations
Else if d is a negative indicator are classified into the genetic domain in the absence
Score (c) = Score (c)  1 of genetic contents); FN refers to the number of false
negatives (citations are wrongly classified into the
The output of this process is a list of scores for all the non-genetic domain), and TN is the number of true
citations, with the highest total given to those negatives (citations are correctly classified into the
citations containing MeSH descriptors typical for the non-genetic domain). The threshold can be adjusted
genetic domain.

AMIA 2008 Symposium Proceedings Page - 360

to modify the trade-off between false positives and found in our stop word list and were excluded from
false negatives. further processing. For each descriptor in the list we
then incremented or decremented the decision score
Besides prediction accuracy, the performance
according to the indicator (Table 1). The indexing
measures of precision, recall, and their combination,
process resulted in two scores:
the F-measure, were used to evaluate the results of
the classification algorithm. Precision (Pre) and Score (PMID: 15697942) = 6
recall (Rec) were calculated as: Score (PMID: 15651293) = 4.
TP TP According to the previously defined threshold, we
Pre and Rec .
TP  FP TP  FN might classify the first citation into the non-genetic
domain and the second into the genetic domain.
The F-measure combines precision and recall into the
single measure The performance of the classifier was evaluated on
the validation set of citations. The proposed
2 u Rec u Pre algorithm achieved a predictive accuracy of 0.91
F .
Rec  Pre with 0.64 recall and 0.93 precision. Figure 3 shows
the results of recall and precision with a varying
The range of all measures falls between 0 and 1, with
threshold parameter.
1 indicating the best quality of the classifier.

Results A

In Figure 2 the predictive accuracy as a function of

cut-off value is depicted in order to visualize the
performance at different points along the decision C
score distribution. The threshold parameter (  2)

was set to optimize predictive accuracy on a


calibration set and was not modified during testing.

For example, all citations for which the decision

score was greater or equal to the specified threshold

value were classified as genetic domain citations.




0.0 0.2 0.4 0.6 0.8 1.0



Figure 3. Precision-Recall plot for genetic domain


citations classification.

20 15 10 5 0 5 10 15

The graph goes through two points. The first (0,1) is
Figure 2. Calibration plot. where the classifier detects no genetically relevant
citations. In this case it always gets the non-genetic
The following example illustrates the use of our citations right but it gets all genetic domain citations
classification algorithm. We retrieved two wrong. The second point (1,0) is where all citations
MEDLINE citations titled, “Strain-dependent are classified as genetically relevant. So the classifier
localization, microscopic deformations, and gets all genetically relevant citations right, but it gets
macroscopic normal tensions in model polymer all non-genetic citations wrong. As expected, a trade-
networks” (PMID: 15697942) and “Recessive motor off between precision and recall exists that can be
neuron diseases: mutations in the ALS2 gene and tuned, mainly by modifying the threshold parameter.
molecular pathogenesis for the upper motor The optimal accuracy with recall and precision
neurodegeneration” (PMID: 15651293) with 8 and 7 already mentioned above is defined at point A. For
corresponding MeSH descriptors, respectively. Three example, if we increase the threshold (B = 5), we
of the descriptors in the latter example were also classify citations with more precision (Pre = 1.00)
but with decreased recall (Rec = 18). On the other

AMIA 2008 Symposium Proceedings Page - 361

hand, if we decrease the threshold (C = 1), citations 5. Chen L, Liu H, Friedman C. Gene name
are less precisely classified (Pre = 0.67) but with ambiguity of eukaryotic nomenclatures.
increased recall (Rec = 0.82). Bioinformatics. 2005;21(2):248-256.
6. Di Fabio F, Alvarado C, Majdan A, et al.
Conclusions and Further Work Underexpression of mineralocorticoid receptor
In this paper we have presented a fast and simple in colorectal carcinomas and association with
MeSH descriptor based classification algorithm in VEGFR-2 overexpression. J Gastrointest Surg.
order to classify MEDLINE citations into the genetic 2007;11(11):1521-1528.
or the non-genetic domain. This algorithm serves for 7. Weeber M, Schijvenaars BJ, Van Mulligen EM,
gene symbol disambiguation in the BITOLA et al. Ambiguity of human gene symbols in
biomedical discovery support system. In the LocusLink and MEDLINE: creating an
validation study, the overall predictive accuracy inventory and a disambiguation test collection.
reached for the classifier was 0.91. The trade-off Proc AMIA Symp. 2003;704-708.
between precision and recall can be tuned by
changing a single threshold parameter. 8. Savova G, Pedersen T, Purandare A, Kulkarni A.
Resolving ambiguities in biomedical text with
Future work will consist of (i) constructing more unsupervised clustering approaches. Research
complex classifiers with respect to MeSH qualifiers Report UMSI 2005/80 and CB Number 2005/21;
(i.e., subheadings) and descriptor/qualifier tuples, Minneapolis, Minnesota: University of
and (ii) compiling a larger set of annotated citations Minessota Supercomputing Institute; 2005.
in order to validate our experiments with a higher
degree of confidence. 9. Schijvenaars BJ, Mons B, Weeber M, et al.
Thesaurus-based disambiguation of gene
Acknowledgements symbols. BMC Bioinformatics. 2005;6:149.

We are grateful to Susanne M. Humphrey and 10. Humphrey SM, Rogers WJ, Kilicoglu H,
Thomas C. Rindflesch for helpful suggestions and Demner-Fushman D, Rindflesch TC. Word
comments. This work was supported by Slovenian sense disambiguation by selecting the best
Research Agency Grant J3-7411. semantic type based on Journal Descriptor
Indexing: preliminary experiment. J Am Soc
Inform Sci Tech. 2006;57(1):96–113.
11. Liu H, Lussier YA, Friedman C. Disambiguating
1. Swanson DR. Fish oil, Raynaud's syndrome, and ambiguous biomedical terms in biomedical
undiscovered public knowledge. Perspect Biol narrative text: an unsupervised method. J
Med. 1986;30(1):7-18. Biomed Inform. 2001;34(4):249-261.
2. Hristovski D, Stare J, Peterlin B, Dzeroski S. 12. Xu H, Fan JW, Hripcsak G, Mendonca EQ,
Supporting discovery in medicine by association Markatou M, Friedman C. Gene symbol
rule mining in MEDLINE and UMLS. Medinfo. disambiguation using knowledge-based profiles.
2001;10(2):1344-1348. Bioinformatics. 2007;23(8):1015-1022.
3. Hristovski D, Peterlin B, Mitchell JA, Humphrey 13. Farkas R. The strength of co-authorship in gene
SM. Using literature-based discovery to identify name disambiguation. BMC Bioinformatics.
disease candidate genes. Int J Med Inform. 2008;9:69.
14. Maglott D, Ostell J, Pruitt KD, Tatusova T.
4. Manning CD, Schuetze H. Foundations of Entrez Gene: gene-centered information at
statistical natural language processing. NCBI. Nucleic Acids Res. 2007;35(Database
Cambridge: MIT Press; 2003. issue):D26–31.

AMIA 2008 Symposium Proceedings Page - 362