CS277 DM Final Report

Data Mining Project - Final Report
Even Cheng
CS277 Data Mining
Mar 18, 2010
P RO J E C T : F I N A L R E P O RT
Problem Definition:

Multi-label Classification Problem in Wikipedia Dataset.
1. Introduction:

In various aspects, we run into the needs of doing multi-label classification in our daily life. From
the categories in libraries to the categories in marts, by human’s all senses we can do classification job in
extremely high accuracy even though facing some objects which have similar labels. But will it be another
story when the computer does the same thing? Nowadays, there are some well defined and widely used
classification method such as K-nearest neighbor algorithm, Gaussian mixture model, Neural Network,
Support Vector Machine and so on. They may achieve a fair prediction result when the difference among
labels are distinct. But in the case that labels are closely related, it will be more difficult for computers to
classify them with a good accuracy returned.

In this project, we simulate the situation described above by using the dataset from partial
categories within the political concept in Wikipedia, and we choose Support Vector Machine as the
classification method. In implementation aspects, we adopt LIBSVM as a tool that provides efficient
model training, prediction making, evaluations, and more, there are plenty of miscellaneous tools that
developed based on LIBSVM, such as parameter finder writing in python[3] and multi-label classification
tool[2] which provide a good sense for our research.
Here are a couple questions that in the final report this project will answer:
• What can we do if the dataset is unbalanced?
• How the accuracy will vary if we use different amount of labels in the classification process?
• Is it true we can always get a better accuracy by using tf-idf when generate feature vectors?
• How will the accuracy change if we increase the size of our training set?

In this report, we will start from reviewing papers related to multi-label classification in Section
2, and briefly cover our dataset in Section 3. Section 4 we introduce how we adopted the tool LIBSVM,
and interpreted its input and output. Section 5 contains the experiment results and short discussions for
answering the questions above. In the last two part, we conclude this report with suggestions for future
work and conclusion.
2. Related Works:

Single label classification means we label an object with exactly one label. However in our real
life, there are many things that have various spectrum that we need to use different labels to reflect their
states. Multi-label classification is originally motivated by the ambiguity in text categorization[4]. Freund
and Schapire proposed AdaBoost[5] and later Schapire continued working on the extension BoosTexter
[6], but in later times, computer scientists implement it to different ways. For example the image
1/8
recognition on scenes and humans. In the image recognition topic, we usually will classify a set of feature
vectors into a combination of labels rather than a specific label returned[7][8].

There are several popular ways to construct our classifier, for instance: Naïve Bayes classifier, K-
nearest Neighbor algorithm proposed by Cover and Hart in 1967, Decision Tree proposed by Murthy in
1998, and Support Vector Machine proposed by Vapnik in 1995[9].

Support Vector Machine is a vector-based, supervised machine
learning method used for classification and regression. In the scenario of
two class dataset that separable by a linear classifier, the aim here we are
going to find out the maximum margin that can separate two classes.The
data nodes on the margin are called the support vector.[10]

LIBSVM[1] is a library for support vector classification and
regression. It implements various formulation and provides a model
selection tool that does cross validation via parallel grid search which is
more reliable way for model selection. And more, it supports multi-class
classification by using “one-against-one” approach proposed by S. Knerr.
Figure 1: the margin between two classes of
that can help us walk into the discussion regarding to multi-label data is maximized[11]
classification.
3. Data Analysis:

The Wikipedia dataset is extremely unbalanced. For example, the 9th category: “Development”. It
contains 181 articles in it, which is a large category while there are other categories which has less than
5 articles in it. Besides, since we will only assign “+1” label to the articles belongs to the 9th category and
“-1” label to the rest, most of the labels in our training set will be “-1” for the total amount of articles is
11,500. It forms the one versus all comparison and intuitively, the predictor can easily predict all labels as
“-1” with a 99% high accuracy achieved which is not likely favorable by us.

Left aside of the unbalanced proportion of labels, this dataset is a large dataset as running it on a
PC since it has 11,500 articles and almost 28,600 vocabularies in this dataset, what we can foresee is the
tasks for doing cross validation, and training model will be exhaustive in time.
4. Classification by Support Vector Machine (LIBSVM):
4.1.Before involving into the process, we divide our data into training, validating and testing data. We
leave 1/10 of data as testing data that never seen in the training process. In the rest of 9/10, we
are going to run 10-fold validation that will be described in following section.
4.2.Pick one category that we are going to do classification on it. Pull out data in the wikipedia
dataset into a intermediate file by each wiki page.
4.3.Read the intermediate file into the program written by MS Visual Studio, and it organizes data
into the structure that described above. Here each line represents a wiki page so every time we
are going to get around 9 thousand lines as our training set.
2/8
4.3.1.<label>: There is only two

values in it: [category_id],
the given category that we
are looking for, and 0. Hence
0 here represents this wiki
page does not belong to the
given category, while
[category_id] shows this
article is a part of the
members.
4.3.2.<index>: It is represented
by the [term_id] that we
used in the dataset, and each
[term_id] is associated to a
certain vocabulary.
4.3.3.<value>: The value here
Figure 2: The flow of integrating LIBSVM to our project
indicates how frequent is the
term shows up. We like to make
comparison regarding to the difference by calculating frequency of vocabularies in an article
with the frequency of vocabularies in the whole corpus, and see which way benefits more to
our prediction.
4.4.Then we treat the output in step 3. as an input to LIBSVM, and we run ‘svm-train’ to help us train
the model. Also, LIBSVM provides a parameter ‘-v’ that can split data into n parts and run the n-
fold cross validation. And this will calculate the cross validation accuracy/mean squared error on
these data for helping to select good parameters. (need to do further reading about the rest
parameters)
4.5.Tune parameters with patience.
4.6.After we have selected good parameters (how is a good parameter?), run ‘svm-train’ again
without ‘-v’ parameter to gain the model for this category.
Figure 3: the command line output after running svm-train
4.7.When we comes to predict unseen data, the test set is fed as a parameter of svm-predict and we
use the tuned model to predict the labels of the test set. After LIBSVM finishing the predicting
step, the output file is ready to retrieve in following format:
3/8
Figure 4: the command output (accuracy) and file output of svm-predict
4.8. We collect the results by using macro average value since we want to give equal value to every
individual category no matter its size of category.
5. Experiment Results:
5.1. Use weighting parameter to deal with unbalanced dataset

In the beginning, after some short iterations and cross-validation are executed, we discovered
the results of all these iterations have the same accuracy, which is achieved by assigning all labels into “-1”
since in our training data, the “-1” labels are the majority about 100~1000 times to the “+1” label. To
solve this problem, the document in in LIBSVM indicates a weighting parameter which can be assigned
with different weight to various labels according to the distribution of labels or the numbers we
preferred. In the training set, we set the weight1 as following:

The purpose of assigning “-1” label as the value in the formula above is that we want to make
the total weights of “+1” and “-1” to be equal. In Figure 5, we assigned 0.015656 as the weight of “-1”
labels, and we can see in the following discussion, the number is pretty close to the number we got after
hundreds of trials.

In the followings are sample validation results while we tuning the weights in training stage, rest
parameters are fixed: SVM type: C-SVC, cost (C) = 1 and those not mentioned are in their default
values. For the expression “-w1 10 -w-1 0.05” in the weight column it means we give “+1” labels weight
10 and “-1” labels weight 0.05.
1The weighting system in LIBSVM is not much sensitive to weight which is greater than 10, but take obvious effect to the weight in
between 0~1.
4/8
Weight Accuracy Precision Recall rho

-w1 10 -w-1 0.05 98.4369% NaN (0/0) 0% (0/163) -0.968869
(10265/10428)
-w1 10 -w-1 0.001 1.5631% (163/10428) 1.5631% (163/10428) 100% (163/163) 0.936390
-w1 10 -w-1 0.0155 1.7453% (182/10428) 1.56595% (163/10409) 100% (163/163) 0.014046
-w1 50 -w-1 0.0157 4.01803% (419/10428) 1.57387% (160/10166) 98.1595% (160/163) 0.001324
-w1 10 -w-1 0.01572 8.51554% (888/10428) 1.59009% (154/9685) 94.4785% (154/163) 0.000052
-w1 10 -w-1 22.9382% (2392/10428) 1.49107% (121/8115) 74.2331% (121/163) 0.000000

0.015720812475
Figure 5: Different weighting values reflect to accuracy, precision, and recall

After a few trials we can observe the trade-off between total accuracy and precision/recall. And
we can discover that if we want to get positive precision and recall, the rho value has always to be
greater than 0 or otherwise it will predict all labels as the “-1”. Besides, if the rho value is very close to
0, we can expect to achieve a higher accuracy in prediction (but we will lose in recall as well!)
5.2. Accuracy drops down while not exactly following the power law

In the Section 5.2 and 5.3, for efficiently making use of the time, we classify articles from [2, 3, 4,
5, 6, 7, 8, 9, 10, 20, 30, 40, 50, 100, 150, 200, 250, 300] categories respectively for 50 times, calculate the
average accuracy from these 50 times and plot the figure.

The possible explanation here why the curve does not follow the power law can result from the
high variance of the data and unbalanced category size of the dataset. So in each time we randomly draw
out a certain number of categories and run classification, maybe in some cases we draw more small
Figure 6: Number of categories - Accuracy, axises are in Figure 7: Number of categories - Accuracy, axises are in log
linear scale. scale with an auxiliary line shows the expected curve if the
result follows power law.
5/8
categories or more large categories. Although in the end we average all the running results we have and
it seems to be a more fair act as handling unbalance dataset, but how it can help still needs further look
up.
5.3. Use tf-idf to construct the feature vector is not always a wise choice.

In this Section we
want to exam whether tf-idf
really works. According to the
definition of tf-idf we get:

In the figure 8, we can
see the running results turn
out that almost always the
feature vectors generated
under without tf-idf gains Figure 8: Number of categories in classification - Accuracy, axises are in linear scale.
higher accuracy than the ones
we implemented with tf-idf. And the
read line here is our based line which is classified randomly by monkey.

We think it can be one of the possibilities is the data has been normalized, and the second
possible reason is that we use all the data available as feature vectors to train the model instead of
trimming smaller articles and unusual vocabularies. In this case the feature vectors have already provided
so much information needed that even though we utilize tf-idf in our case, it may not give us better
accuracy.
5.4. To increase training data size mostly guarantee a higher accuracy

In this topic we want to do experiments on the size of training data set, so we randomly
generated 10 and 50 categories of articles and divided them into n for training data and 1-n for testing.
We can see no matter the results in articles from 10 categories or 50 categories keep raising their
accuracy while we put more data into our training set in the most of time. The reason that the values
from Figure 9 and Figure 10 in the both end have higher variance because for the left hand side, we have
very little training data and it could result in the trained model can never predict some categories if the
categories are never shown up in the training data; in the case of right hand side, we run into the
contrary situation, that is, we have very large amount of training data while very little amount of data for
6/8
testing, and the accuracy for individual category can go up and down very easily and it is the reason we
get high variance at the right hand side in Figure 9 and 10.
Figure 9: Training data size, gathered from 10 random categories - Accuracy
Figure 10: Training data size, gathered from 50 random categories - Accuracy
6. Future Work

This project just reveal a little part of the idea, and if we have time allow us walk further, we can
extend the results of Section 5.2, and calculate the accuracy in Q1 and Q3 regarding to different size of
label we have to deal with. For the result of Section 5.3, we suggest to make a comparison with this one
for the case that some noise has trimmed from the dataset, and examine the accuracy and tf-idf again.
One another aspect worths to do experiments on it is the tree-like hierarchy relationship among
categories. Multi-label classification in closely related labels can be a nice topic to try.
7. Conclusion

This report presents several experiments regarding to multi-label classification in Wikipedia
dataset by using support vector machine: We have done experiment regarding to tuning unbalanced data
by giving different weights in training step; the accuracy will drop done as we put more labels into
7/8
classification but it not necessary follows power law; the weighting skill of tf-idf does not always work,
and the last part, mostly we get better accuracy as we increase our training data size.
Afterword

After finishing these experiments, I feel this project is very meaningful to me. I am happy that I
did not get the expected results from experiments, and it gives me a good incentive to dig out the
reasons which causes the results, and it also makes me have the curiosity to analyze the results regarding
to possible reasons to make the situation happened.
8. Reference:
1. Chang, C.-C. and C.-J. Lin., “LIBSVM: a library for support vector machines.” Software available at
http://www.csie.ntu.edu.tw/~cjlin/LIBSVM. (2001)
2. http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/multilabel/
3. http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/#grid_parameter_search_for_regression
4. Zhang, M.L., Zhou, Z.H., “A k-nearest neighbor based algorithm for multi-label classification.”
Proceedings of the 1st IEEE International Conference on Granular Computing, pp. 718–721. IEEE
Computer Society Press (2005)
5. Freund, Y. and Schapire, R.E., “A decision-theoretic generalization of on-line learning and an
application to boosting,” in Lecture Notes in Computer Science 904, P. M. B.Vita ́nyi, Ed.
B e r l i n :
Springer, pp. 23–37. (1995)
6. Schapire, R., Singer Y., “Boostexter: a boosting-based system for text categorization”, Machine
Learning 39 (2/3) 135–168. (2000)
7. Boutell, M.R., Luo, J., Shen, X. & Brown, C.M., “Learning multi-label scene classification”, Pattern
Recognition, vol. 37, no. 9, pp. 1757-71. (2004)
8. Campbell, N.W., Mackeown, W.P.J., Thomas, B.T., Troscianko, T., “The automatic classication of
outdoor images”, International Conference on Engineering Applications of Neural Networks,
Systems Engineering Association, pp. 339–342 (1996)
9. Kotsiantis, S., “Supervised Machine Learning: A Review of Classification Techniques”, Informatica
Journal 31 (2007)
10. C. D. Manning, P. Raghavan, and H. Schutze. Introduction to Information Retrieval. Chapter 15.
Cambridge University Press, 2008.
11. From Wikipedia: Support Vector Machine:
http://en.wikipedia.org/wiki/File:Svm_max_sep_hyperplane_with_margin.png
12. Liu, T.-Y., Yang, Y., Wan, H., Zeng, H.-J., Chen, Z., and Ma, W.-Y. “Support vector machines
classification with a very large-scale taxonomy”. SIGKDD Explorations Newsletter 7(1), 36–43.
(2005)
13. Tsoumakas, G., Katakis, I. “Multi-label classification: An overview”. International Journal of Data
Warehousing and Mining 3, 1–13 (2007)
8/8

CS277 DM Final Report

Загружено:

Сведения о документе

Исходное описание:

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

CS277 DM Final Report

Загружено:

Авторское право:

Доступные форматы

Data Mining Project - Final Report

4. Classification by Support Vector Machine (LIBSVM):

4.3.1.<label>: There is only two

Figure 3: the command line output after running svm-train

Figure 4: the command output (accuracy) and file output of svm-predict

5.1. Use weighting parameter to deal with unbalanced dataset

Weight Accuracy Precision Recall rho

-w1 10 -w-1 22.9382% (2392/10428) 1.49107% (121/8115) 74.2331% (121/163) 0.000000

Figure 5: Different weighting values reflect to accuracy, precision, and recall

5.4. To increase training data size mostly guarantee a higher accuracy

Figure 9: Training data size, gathered from 10 random categories - Accuracy

Вам также может понравиться