Вы находитесь на странице: 1из 6

Fully Non-Parametric Feature Selection

Models for Text Categorization

Yue Zhao
Department of Computer Science
University of Toronto
Toronto, ON M5B1L2
yuezhao@cs.toronto.edu

Abstract
As an important preprocessing technology in text categorization (TC),
feature selection can improve the scalability, efficiency and accuracy of a
text classifier [1]. In this paper, we propose a two-stage TC modela fully
non-parametric feature selection model. This classification model first
selects features using non-parametric statistical methods, and then applies
non-parametric classifiers on the selected features for categorization. The
improvements are based on two ideas: non-parametric models require no
prior knowledge of the data probability distribution, and feature selection
could improve classification efficiency and accuracy. We find a competitive
fully non-parametric feature selection model: using artificial neural network
(ANN) on the features selected by " statistic (CHI). This model obtains
better classification accuracy and requires fewer features than that of the
state-of-the-art classifier (support vector machine) on real-world datasets.
Therefore, ANN with CHI classification model is good to use on the data
with unknown underlying probability distribution. We also discover that two
statistical feature selection methods, Kendall rank correlation coefficient
and Spearman rank correlation, are strong correlated in TC feature
selection.

1 I n t ro d u c t i o n
TC is a pattern classification task for texting mining and necessary for efficient management
of textual information system [2]. In text classification, one typically uses a bag of words
model: each position in the input feature vector corresponds to a given word or phrase [3].
Therefore, data could be represented as very high-dimensional but sparse vectors [4].
To deal with high-dimensional sparse data, statistical feature selection methods are always
used to select a subset of relevant features for the model. Then a classifier is applied on the
subset of features to proceed categorization. Effective feature selection could make large
problems more computationally efficient [3], and make the learning task more accurate [3].
Usually, the choice of statistical feature selection methods and classifiers is heuristic and
empirical due to the limited knowledge of the data. This could result in the selection of less
important features and poor classification accuracy, when the chosen feature selection
method or classifier has different assumption from the underlying model.
Therefore, we propose to use fully non-parametric models for unknown data. In fully
non-parametric models, we use non-parametric methods to select features and apply
non-parametric classifiers on selected features for TC. Because fully non-parametric models
make no assumption about the data, the applicability is much wider and the quality is more
reliable than corresponding parametric models. While many feature selection techniques
have been tested on popular text classification datasets [3], such as Reuters-2578 and
OHSUMED, thorough evaluations on this combined model are rarely discussed. Our
experiments on real-world datasets may demonstrate the potential of this model. We seek the
answers to the following questions with empirical evidence:
Which fully non-parametric feature selection models outperform? Given the high
dimensionality of feature space, do such models even exist?
Do fully non-parametric feature selection methods share some characteristics in
common in text categorization?
The rest of this paper is organized as follows. Section 2 describes the statistic feature
selection methods. A parametric selection method, Pearson correlation coefficient, is
included to conduct the comparison. Section 3 describes the data collection and classifiers.
For comparison purpose, a linear kernel support vector machine(parametric) is also included.
Section 4 describes the design of experiments and reports the evaluation results. The
conclusion is drawn in Section 5.

2 S t a t i s t i c a l feature s e l e c t i o n methods
2.1 s t a t i s t i c ( C H I )
Chi-Squared is a common non-parametric statistical test that measures divergence from the
distribution expected if one assumes that feature occurrence is actually independent of the
class value. Yang [6] stated that Information gain and Chi-Squared are most effective in
aggressive term removal without losing categorization accuracy in TC.

2.2 Kendall rank correlation coefficient (KRCK)


Kendall rank correlation coefficient is a non-parametric statistic used to measure the
association between two measured quantities: the similarity of the data when ranked by each
of the quantities.

2.3 Spearman rank correlation (SRC)


Spearman rank correlation is the non-parametric statistical measure of strength of a
monotonic between two ranked variables.

2.4 Pearson product-moment correlation coefficient (PPMCC)


Pearson product-moment correlation coefficient is widely used parametric methods to
measure the degree of the relationship between linear related variables.

3 Data and Classifiers


3.1 Data collection
The data we use in this study is Cuisine-Ingredient data [7], a public dataset provided by
Kaggle.com and Yummly. This is a real-world dataset and no prior knowledge exists.
The JSON format corpora contains 39774 recipes. In each recipe, there is an associated ID,
cuisine style, and a list of ingredients. In total, there are 6714 different ingredients, such as
Japanese fish, and 20 cuisine style classes, such as Japanese. We divide the corpora into
training set of 33808 entries, and a testing set of 5966 entries.
The feature space is built by bad-of-words. We extract all ingredients into an ingredient
dictionary. As each ingredient is unique within an entry, the input vector of a single entry is
based on the presence of an ingredient in this entry. If a recipe contains an ingredient, 1 is
assigned to the ingredient position in the recipe feature vector. Otherwise, 0 is assigned. In
short documents, words are unlikely to repeat, making Boolean word
indicators(presence/absence) nearly as informative as counts [3]. Training and testing targets
are built on the cuisine style associated with each recipe. Converted by bag-of-words, more
than 99.8% of feature space is filled by 0. Without feature selection, the dimensionality of
feature of each entry is 6613 and no more than 21 of the features have meaningful values.

3.2 Feature engineering

Removing of rare words that occur two or fewer times may be useful [3]. Forman
shows that words in the corpus usually follow the distribution, where few
words occur frequently while many words occur rarely [3].
Stemming and lemmatizingmerging various word forms such as plurals and verb
conjugations into one distinct termalso reduces the number of features to be
considered [3]. We use Porter Stemming Algorithm [8] in this study.
Cosine similarity reduction analysis is heavily utilized to find the similarity of two
vectors. We used this analysis to remove similar input vectors for reducing the
dimensionality before the feature selection.

3.3 Classifiers
3.3.1 K nearest neighbors (KNN)
KNN is a simple but effective non-parametric classification algorithm that stores all
available cases and classify new cases based on a similarity measure. KNN is outstanding
with its simplicity and is widely used techniques for TC [10].

3.3.2 Artificial neural networks (ANN)


Artificial Neural Networks are nonparametric and composed of a large number of highly
interconnected processing elements working in unison to solve specific problems. It has
several well-known advantages: adaptive learning ability, self-organization structure,
real-time operation and high fault tolerance.

3.3.3 Multiclass decision forest (MDF)


Random forest is a non-parametric trademark classifier for an ensemble of decision trees.
MDF uses averaging to find a natural balance between high variance and high bias to achieve
good result.

3.3.4 Multiclass decision jungle (MDJ)


Multiclass decision jungle is a non-parametric classifier proposed by Shotton [9], based on
ensemble idea. It is a compact and powerful discriminative model that uses rooted decision
directed acyclic graphs to allow multiple paths from the root to each leaf [9].

3.3.5 Support vector machine (SVM)


A Support Vector Machine is a supervised parametric classification algorithm. It aims to find
the hyperplanes that separate data points with maximal margins between decision
boundaries. It has been extensively and successfully used in text classification task [2] and
empirical results have shown that SVM classifiers are performing well [11]. Pawar [2] stated
that SVM has the highest classification precision all the time.

4 E mpir ic al va lid at io n
4.1 Performance measures
The effectiveness of feature selection is measured by overall accuracy of the current setting:

where means true positives, means true negatives, means false positives,
means false negatives, and stands for different categories.

4.2 Experiment settings


Before applying feature selection techniques, we build the feature space, remove rare words,
perform the stemming and lemmatizing optimization, and conduct cosine similarity analysis.
Then each of four statistical feature selection models is used to pick different number of
features. The top-ranked 1, 5, 10, 20, 30, 40, 50, 60, 70, 80, and 90% features are selected to
build the new feature space. Then the evaluation of accuracy is performed on all five
classifiers with the new feature space.

4.3 Primary results

Figure 1 displays the performance curve for CHI on Cuisine-Ingredient data (33808 training
recipes and 5966 testing recipes) with KNN, ANN, MDJ, MDF, and SVM respectively.
Figure 2, Figure 3 and Feature 4 display the corresponding classification performance curves
for KRCK, SRC and PPMCC, respectively. The major performance factor is overall accuracy
of classifiers when different number of features are selected by different selection methods.
Figure 5 (MDJ), Figure 6 (ANN), Figure 7 (MDF), Figure 8 (KNN), and Figure 9 (SVM)
show each classifier performance curve under CHI, KRCK, SCR and PPMCC methods.

4.4 Discussion
Overall accuracy: While conducting many experiments, we have found the best overall
accuracy (0.770) is achieved by ANN (non-parametric) when 3967 features are selected by
CHI (non-parametric) ranking. The second best result (0.769) is achieved by SVM
(parametric) when extracting 5951 features using CHI(non-parametric). Among all
non-parametric classifiers, ANN has the highest overall accuracy and MDJs overall accuracy
is the lowest, on all feature selection methods.
Performance stability: Taking all classifiers into account, we find that ANN, SVM, and
KNN have stable performances under various feature selection models. MDJ and MDF have
unpredictable performances on all feature selection methods.
Selection methods performance: CHI outperforms all other feature selection methods for
all classifiers except MDJ. For MDJ, parametric selection method PPMCC is slightly better
than other feature selection methods. However, MDJ has poor performance stability.
Parametric model, PPMCC, has close performance as that of two non-parametric selection
methods, KRCK and SCR.

4.5 Correlations
One interesting point is Figure 5, Figure 6, Figure 7, Figure 8 and Figure 9 suggest there is a
strong correlation between SRC and KRCK. The difference of these two selection methods
are within 5% on SVM and ANN, and they are exactly the same on MDJ, MDF, and KNN.
Another interesting point is from KNN: the number of selected features is proportional to the
overall accuracy on all feature selection models. This implies that KNN is not a good choice
to serve as the classifier in fully non-parametric feature selection models.
5 C o n c l u d i n g re m a r k s
In this paper, we examine the TC performances of fully non-parametric feature selection
models, which combine a non-parametric feature selection method with a non-parametric
evaluation classifier. We conduct an extensive study of the performance of 240 variants of
feature selection methods and classifiers. 3 non-parametric statistical methods (CHI, KRCK,
SRC) and 1 parametric selection method (PPMCC) are used to select features. 4
non-parametric classifiers (KNN, ANN, MDJ, and MDF) and 1 parametric classifier (SVM)
are included to perform categorization.
We find CHI (non-parametric) generally outperforms and ANN has the best performance
among all non-parametric classifiers. ANN(non-parametric) on CHI(non-parametric) has the
best overall accuracy and outperform SVM (parametric). Additionally, ANN on CHI needs
fewer features than SVM on any feature selection method to achieve a better accuracy. ANN
on CHI, a fully non-parametric model, outperforms the state-of-the-art approaches, in
regards of overall accuracy and needed features. It could be a fundamental feature selection
model for data with unknown probability distribution.
We also find that SRC and KRCK have very strong correlation in feature selection. This
suggests that only one of them is needed in feature selection for TC. One could make a
choice based on the complexity of the implementation and execution.
Acknowledgments
I would like to thank Prof. Ruslan Salakhutdinov for approving the project and Prof. Richard
S. Zemel for encouraging our explorations. I also thank Prof. Omid Shabestari for providing
insights on conducting large experiments on distributed system.

References
[1] Chen, J., Huang, H., Tian, S., & Qu, Y. (2009). Feature selection for text classification with Nave
Bayes. Expert Systems with Applications, 36(3), 5432-5435.
[2] Pawar, P. Y., & Gawande, S. H. (2012). A comparative study on different types of approaches to
text categorization. International Journal of Machine Learning and Computing, 2(4), 423-426.
[3] Forman, G. (2003). An extensive empirical study of feature selection metrics for text classification.
The Journal of machine learning research, 3, 1289-1305.
[4] Liu, K., Bellet, A., & Sha, F. (2014). Similarity Learning for High-Dimensional Sparse Data. arXiv
preprint arXiv:1411.2374.
[5] Said, D. A. (2007). Dimensionality reduction techniques for enhancing automatic text
categorization (Doctoral dissertation, Faculty of Engineering at Cairo University in Partial Fulfillment
of the Requirements for the Degree of MASTER OF SCIENCE in COMPUTER ENGINEERING
FACULTY OF ENGINEERING, CAIRO UNIVERSITY GIZA).
[6] Yang, Y., & Pedersen, J. O. (1997, July). A comparative study on feature selection in text
categorization. In ICML (Vol. 97, pp. 412-420).
[7] Kaggle.com Cuisine dataset: https://www.kaggle.com/c/whats-cooking/data
[8] Porter, M. F. (1980). An algorithm for suffix stripping. Program, 14(3), 130-137. categorization
system. Journal of Computer Science, 3(6), 430-435.
[9] Shotton, J., Sharp, T., Kohli, P., Nowozin, S., Winn, J., & Criminisi, A. (2013). Decision jungles:
Compact and rich models for classification. In Advances in Neural Information Processing Systems (pp.
234-242).
[10] Rogati, M., & Yang, Y. (2002, November). High-performing feature selection for text
classification. In Proceedings of the eleventh international conference on Information and knowledge
management (pp. 659-661). ACM.
[11] Moh'd A Mesleh, A. (2007). Chi square feature extraction based SVMs Arabic language text

Вам также может понравиться