Вы находитесь на странице: 1из 52

Gender Classification of a Blog Authors

A THESIS & PROJECT REPORT


SUBMITTED IN PARTIAL FULFILMENT OF THE REQUIREMENTS FOR THE AWARD OF THE DEGREE OF

Bachelor of Technology In Computer Engineering

Under the Supervision of Mr. Tanvir Ahmad Associate Professor

Submitted by
Shams Abubakar Azad(08CSS63) Yasir Zafar Ansari(08CSS74)

DEPARTMENT OF COMPUTER ENGINEERING FACULTY OF ENGINEERING AND TECHNOLOGY JAMIA MILLIA ISLAMIA, NEW DELHI 110025 (2011-2012)

CERTIFICATE
This is to certify that the dissertation/project report entitled Gender Classification of Blog Authors has been successfully completed by Shams Abubakar Azad (08CSS63) and Yasir Zafar Ansari is an authentic work carried out by them at Faculty of Engineering and Technology, Jamia Millia Islamia under my guidance. The matter embodied in this project work has not been submitted earlier for the award of any degree or diploma to the best of my knowledge and belief.
Date: 19/05/2012

Mr. Tanvir Ahmad Associate Professor Department of Computer Engineering Faculty of Engineering & Technology Jamia Millia Islamia New Delhi 110025

Prof. M.N.Doja Head of Department Department of Computer Engineering Faculty of Engineering & Technology Jamia Millia Islamia New Delhi-110025

Acknowledgement
We have taken efforts in this project. However, it would not have been possible without the kind support and help of many individuals and organizations. We would like to extend our sincere thanks to all of them. We are highly indebted to our supervisor and guide Mr. Tanvir Ahmad for his motivational and modernistic ideas as well as constant supervision for providing necessary information regarding the project and also for his support during the course of the project. We would like to express our gratitude towards the faculty members for their kind co-operation and encouragement which helped us in completion of this project. At last we would also like to express our sincere thanks to our family members and our colleagues.

-Shams Abubakar Azad -Yasir Zafar Ansari Department of Computer Engineering Faculty of Engineering & Technology Jamia Millia Islamia

ABSTRACT
Classifying the gender of a blog author generally gives hidden information about their likes and dislikes. It has important applications in many commercial domains. We report results of stylistic differences in blogging for gender variation. The results are based on three mutually independent features. The first feature is the use of frequency counter. For the second feature, we have implemented TF-IDF for calculating weight of each tokens and then finding the gender of blog authors through it. In the last technique we have introduced POS-tagger which is used to find parts-of-speech of each blog written by blog authors. These features are augmented with previous study results reported in literature for stylistic analysis. These machine learning experiments were done on a demographically tagged blog corpus.

TABLE OF CONTENTS 1. INTRODUCTION 1.1 Data Set 2. RECENT WORK/RELATED WORK 3. PLATFORM, TOOLS AND LANGUAGES 3.1 NETBEANS IDE 3.2 WEKA 3.3 JDK 3.3.1 JDK Contents 3.3.2 Ambiguity between JDK and SDK 3.4 NAVE BAYES CLASSIFIER 3.4.1 Introduction 3.4.2 The Naive Bayes probabilistic model 4. FREQUENCY COUNTER 4.1 How frequency counter works? 4.2 Why we have used the frequency counter? 5. TF-IDF 5.1 Importance of tf-idf 5.2 Mathematical formula for tf-idf 5.3 Steps for tf-idf 1 2 3 4 4 6 8 8 10 11 11 12 13 13 15 19 20 22 24

6. POS TAGGER 6.1 Architecture of POS Tagger 6.2 Design of POS Tagger 7. RESULT AND DISCUSSION 7.1 Observation and discussion 8. CONCLUSION 9. FUTURE WORK 10. USES 11. REFERENCES

32 34 37 43 43 44 44 45 45

INTRODUCTION
Weblogs, commonly known as blogs, refer to online personal diaries which generally contain informal writings. With the rapid growth of blogs, their value as an important source of information is increasing. A large amount of research work has been devoted to blogs in the natural language processing (NLP) and other communities. Recent work has also attempted gender classification of blog authors using features such as content words, dictionary based content analysis results, POS (part-ofspeech) tags and feature selection along with a supervised learning algorithm. There are also many commercial companies that exploit information in blogs to provide value-added services, e.g. blog search, blog topic tracking, and sentiment analysis of peoples opinions on products and services. Stylistic classification can improve the results achieved through Information Retrieval (IR) techniques by identifying documents that matches a certain demographic profile. Gender is the common demographic features used for experimentation using stylistics as the blogs generally contain these information provided by the author. Style in writing is a result of the subconscious habit of the writer of using one form over a number of available options to present the same thing.

DataSet
Our experimental results based on a real life blog data set collected from a large number of blog hosting sites show. We have already collected Dataset used by Bing Liu and Arjun Mukherjee for their paper Improving Gender Classification of Blog Authors, that dataset is having around 3100 entries. To keep the problem of gender classification of informal text as general as possible, the data was collected blog posts from many blog hosting sites and blog search engines also, e.g., blogger.com, technorati.com, etc. The gender of the author will be determined by visiting the profile of the author. Profile pictures or avatars associated with the profile will be also helpful in confirming the gender especially when the gender information is not available explicitly.

RECENT WORK/RELATED WORK


There have been several recent papers on gender classification of blogs (e.g., Schler et al., 2006, Argamon et al., 2007; Yan and Yan, 2006; Nowsonet al., 2005). These systems use function/ content words, POS tag features, word classes (Schler et al., 2006), content word classes (Argamon et al., 2007), results of dictionary based content analysis, POS unigram (Yan and Yan, 2006), and personality types (Nowson et al., 2005) to capture stylistic behavior of authors writings for classifying gender. (Koppel et al. 2002) also used POS n-grams together with content words on the British National Corpus (BNC). (Houvardas and Stamatatos, 2006) even applied character (rather than word or tag) n-grams to capture stylistic features for authorship classification of news articles in Reuters. However, these works use only one or a subset of the classes of features. None of them uses all features for classification learning. Given the complexity of blog posts, it makes sense to apply all classes of features jointly in order to classify genders. The research in last few decades on usage of language pattern by different social groups was constrained due to unavailability of sufficient annotated data. Analysis of effects of bloggers age and gender from weblogs, based on usage of keywords, parts of speech and other grammatical constructs, has been presented in Learning Age and Gender of Blogger. Age linked variations had been reported by Pennebaker, et al., Pennebaker and Stoneand Burger and Henderson, 2006 . J. Holmes distinguished characteristics of male and female linguistic styles . Improving gender based classification and age based classification is also done in paper Improving Gender Classification of Blog Authors by Bing Liu and Arjun Mukherjee.
3

PLATFORM, TOOLS & LANGUAGES


NETBEANS IDE NetBeans IDE is an open-source integrated development environment. NetBeans IDE supports development of all Java application types (Java SE (including JavaFX), Java ME, web, EJB and mobile applications) out of the box. Among other features are an Ant-based project system, Maven support, refactorings, version control. Modularity: All the functions of the IDE are provided by modules. Each module provides a well defined function, such as support for the Java language, editing, or support for the CVS versioning system, and SVN. NetBeans contains all the modules needed for Java development in a single download, allowing the user to start working immediately. Modules also allow NetBeans to be extended. New features, such as support for other programming languages, can be added by installing additional modules NETBEANS PLATFORM The NetBeans Platform is a reusable framework for simplifying the development of Java Swing desktop applications. The NetBeans IDE bundle for Java SE contains what is needed to start developing NetBeans plugins and NetBeans Platform based applications; no additional SDK is required. Applications can install modules dynamically. Any application can include the Update Center module to allow users of the application to download digitallysigned upgrades and new features directly into the running application. Reinstalling an upgrade or a new release does not force users to download the entire application again.
4

The platform offers reusable services common to desktop applications, allowing developers to focus on the logic specific to their application. Among the features of the platform are:

User interface management (e.g. menus and toolbars) User settings management Storage management (saving and loading any kind of data) Window management Wizard framework (supports step-by-step dialogs) NetBeans Visual Library Integrated development tools

NetBeans IDE is a free, open-source, cross-platform IDE with built-in-support for Java Programming Language. The NetBeans IDE is written in Java and can run on Windows, Mac OS, Linux, Solaris and other platforms supporting a compatible JVM. A pre-existing JVM or a JDK is not required. The NetBeans platform allows applications to be developed from a set of modular software components called modules. Applications based on the NetBeans platform (including the NetBeans IDE) can be extended by third party developers.

WEKA
Weka (Waikato Environment for Knowledge Analysis) is a popular suite of machine learning software written in Java, developed at the University of Waikato, New Zealand. Weka is free software available under the GNU General Public License. WHAT IS WEKA? Comprehensive suite of Java class libraries Implement many state-of-the-art machine learning and data mining algorithms WEKA consists of Explorer Experimenter Knowledge flow Simple Command Line Interface Java interface Explorer WEKAs main graphical user interface Each of the major weka packages Filters, Classifiers, Clusterers, Associations, and Attribute Selection is represented along with a Visualization tool Experimenter Comparing different learning algorithms on different datasets with various parameter settings and analyzing the performance statistics Knowledge flow

The Knowledge Flow provides an alternative to the Explorer as a graphical front end to Weka's core algorithms. The Knowledge Flow is a work in progress so some of the functionality from the Explorer is not yet available. Simple command line interface All implementations of the algorithms have a uniform command-line interface.

Attribute Relationship File Format (ARFF) is the text format file used by Weka to store data in a database. This kind of file is structured as follows ("tagger" relational database): @relation tagger @ATTRIBUTE a1 REAL @ATTRIBUTE a2 REAL @ATTRIBUTE a3 REAL @ATTRIBUTE a4 REAL @ATTRIBUTE class {M, F} @DATA 0.6615385,0.092307694,0.16923077,0.046153847,M 0.75,0.14772727,0.06818182,0.022727273,M 0.7096774,0.06451613,0.14516129,0.032258064,M 0.7777778,0.10144927,0.053140096,0.019323671,M 0.67877626,0.10325048,0.10133843,0.0114722755,F

JDK
The Java Development Kit (JDK) is an Oracle Corporation product aimed at Java developers. Since the introduction of Java, it has been by far the most widely used Java Software Development Kit (SDK). On 17 November 2006, Sun announced that it would be released under the GNU General Public License (GPL), thus making it free software. This happened in large part on 8 May 2007 Sun contributed the source code to the OpenJDK.

JDK contents The JDK has as its primary components a collection of programming tools, including: java the loader for Java applications. This tool is an interpreter and can interpret the class files generated by the javac compiler. Now a single launcher is used for both development and deployment. The old deployment launcher, jre, no longer comes with Sun JDK, and instead it has been replaced by this new java loader. javac the compiler, which converts source code into Java bytecode appletviewer this tool can be used to run and debug Java applets without a web browser apt the annotation-processing tool extcheck a utility which can detect JAR-file conflicts idlj the IDL-to-Java compiler. This utility generates Java bindings from a given Java IDL file. javadoc the documentation generator, which automatically generates documentation from source code comments
8

jar the archiver, which packages related class libraries into a single JAR file. This tool also helps manage JAR files. javah the C header and stub generator, used to write native methods javap the class file disassembler javaws the Java Web Start launcher for JNLP applications jconsole Java Monitoring and Management Console jdb the debugger jhat Java Heap Analysis Tool (experimental) jinfo This utility gets configuration information from a running Java process or crash dump.(experimental) jmap This utility outputs the memory map for Java and can print shared object memory maps or heap memory details of a given process or core dump. (experimental) jps Java Virtual Machine Process Status Tool lists the instrumented HotSpot Java Virtual Machines (JVMs) on the target system. (experimental) jrunscript Java command-line script shell. jstack utility which prints Java stack traces of Java threads (experimental) jstat Java Virtual Machine statistics monitoring tool (experimental) jstatd jstat daemon (experimental) policytool the policy creation and management tool, which can determine policy for a Java runtime, specifying which permissions are available for code from various sources VisualVM visual tool integrating several command-line JDK tools and lightweight performance and memory profiling capabilities
9

wsimport generates portable JAX-WS artifacts for invoking a web service. xjc Part of the Java API for XML Binding (JAXB) API. It accepts an XML schema and generates Java classes.

Ambiguity between JDK and SDK The JDK forms an extended subset of a software development kit (SDK). In the descriptions which accompany its recent releases for Java SE, EE, and ME, Sun acknowledges that under its terminology, the JDK forms the subset of the SDK which has the responsibility for the writing and running of Java programs. The remainder of the SDK comprises extra software, such as application servers, debuggers, and documentation.

10

Naive Bayes classifier


A naive Bayes classifier is a simple probabilistic classifier based on applying Bayes' theorem with strong (naive) independence assumptions. A more descriptive term for the underlying probability model would be "independent feature model".

Introduction In simple terms, a naive Bayes classifier assumes that the presence (or absence) of a particular feature of a class is unrelated to the presence (or absence) of any other feature, given the class variable. For example, a fruit may be considered to be an apple if it is red, round, and about 4" in diameter. Even if these features depend on each other or upon the existence of the other features, a naive Bayes classifier considers all of these properties to independently contribute to the probability that this fruit is an apple. Depending on the precise nature of the probability model, naive Bayes classifiers can be trained very efficiently in a supervised learning setting. In many practical applications, parameter estimation for naive Bayes models uses the method of maximum likelihood; in other words, one can work with the naive Bayes model without believing in Bayesian probability or using any Bayesian methods. In spite of their naive design and apparently over-simplified assumptions, naive Bayes classifiers have worked quite well in many complex real-world situations. In 2004, analysis of the Bayesian classification problem has shown that there are some theoretical reasons for the apparently unreasonable efficacy of naive Bayes classifiers. Still, a comprehensive comparison with other classification methods in 2006 showed that Bayes classification is outperformed by more current approaches, such as boosted trees or random forests.
11

An advantage of the naive Bayes classifier is that it only requires a small amount of training data to estimate the parameters (means and variances of the variables) necessary for classification. Because independent variables are assumed, only the variances of the variables for each class need to be determined and not the entire covariance matrix.

The naive Bayes probabilistic model Abstractly, the probability model for a classifier is a conditional model

P(C|F1,....,Fn) over a dependent class variable with a small number of outcomes or classes, conditional on several feature variables through . The problem is that if the number of features is large or when a feature can take on a large number of values, then basing such a model on probability tables is infeasible. We therefore reformulate the model to make it more tractable. Using Bayes' theorem, we write P(C|F1,......,Fn) = (p(C)*p(F1,.....Fn|C))/p(F1,.....,Fn) In plain English the above equation can be written as posterior=(prior*likelihood)/evidence In practice we are only interested in the numerator of that fraction, since the denominator does not depend on and the values of the features are given, so that the denominator is effectively constant.

12

FREQUENCY COUNTER
Word frequency counter is used to count the occurrence of a token in a given file/blog. We have used it to count the number of each token and at last we have divided each token with the total number of words in that particular blog/file, this gives the frequency of each token. We have used ignoreCase() which have ignored the lower as well as upper case, for example 'apple' and 'Apple' will be treated as same. How Frequency counter works? Blogs in the form of .txt files are used as input of our program; we have taken each word of file until EOF is not met try { while (true) { while ( ! in.eof() && ! Character.isLetter(in.peek()) ) in.getAnyChar(); if (in.eof()) break; insertWord(in.getAlpha()); } } catch (IOException e) { TextIO.putln("An error occurred while reading from the input file.");
13

TextIO.putln(e.toString()); return; } After taking Each word in a temporary variable we have applied ignoreCase() on it, we have used an array to put different tokens of file in it. Temporary variable is compare with each element of array and if it is found to be already present in that array then count of that element is increased by one otherwise that variable is included in that array. After counting occurrence of each word of file/blog we have divided it with the total number of words in that particular file to have frequency of each file. We have stored that frequency in deceasing order in a txt file after that we have design a weight matrix with the help of that txt file now matrix is used as a input in weka tool and Naive bayes as well as ZeroR as a classifier are used.

14

Why we have used frequency counter? Frequency counter is not an appropriate method to find the gender of a blog author but still it gives a slight allusion of it. For example women generally talk about household whereas males generally talks about office works, Women may have 'pink' as their favourite colour whereas males may have 'blue' as their favourite colour. This makes sense to use frequency counter as they gives slight information of the nature of words that a male or a female generally uses. Frequency counter is simple to use and understand that the reason it may be used as basic method to find the gender.

15

Figure 1: Frequency count of words

16

Figure 1: Code of counting the word frequencies

17

Figure 3: The output file successfully created

18

TF-IDF
TF-IDF (Term Frequency, Inverse Document Frequency) is a basic technique to compute the relevancy of a document with respect to a particular term. "Term" is a generalized element contains within a document. A "term" is a generalized idea of what a document contains. (E.g. a term can be a word, a phrase, or a concept). Intuitively, the relevancy of a document to a term can be calculated from the percentage of that term shows up in the document (ie: the count of the term in that document divide by the total number of terms in it). We called this the "term frequency" On the other hand, if this is a very common term which appears in many other documents, then its relevancy should be reduced. (ie: the count of documents having this term divided by total number of documents). We called this the "document frequency" The overall relevancy of a document with respect to a term can be computed using both the term frequency and document frequency.

relevancy = term frequency * log (1 / document frequency)

This is called tf-idf. A "document" can be considered as a multi-dimensional vector where each dimension represents a term with the tf-idf as its value.

19

IMPORTANCE OF TF-IDF
However, the main problem with the term-frequency approach is that it scales up frequent terms and scales down rare terms which are empirically more informative than the high frequency terms. The basic intuition is that a term that occurs frequently in many documents is not a good discriminator, and really makes sense (at least in many experimental tests); the important question here is: why would you, in a classification problem for instance, emphasize a term which is almost present in the entire corpus of your documents?

The tf-idf weight comes to solve this problem. What tf-idf gives is how important is a word to a document in a collection, and thats why tf-idf incorporates local and global parameters, because it takes in consideration not only the isolated term but also the term within the document collection. What tf-idf then does to solve that problem, is to scale down the frequent terms while scaling up the rare terms; a term that occurs 10 times more than another isnt 10 times more important than it, thats why tf-idf uses the logarithmic scale to do that. But lets go back to our definition of the tf(t,d) which is actually the term count of the term t in the document d. The use of this simple term frequency could lead us to problems like keyword spamming, which is when we have a repeated term in a document with the purpose of improving its ranking on an IR (Information Retrieval) system or even create a bias towards long documents, making them look more important than they are just because of the high frequency of the term in the document.

20

To overcome this problem, the term frequency tf(t, d) of a document on a vector space is usually also normalized.

Variations of the tf*idf weighting scheme are often used by search engines as a central tool in scoring and ranking a document's relevance given a user query. tf*idf can be successfully used for stop-words filtering in various subject fields including text summarization and classification.

21

MATHEMATICAL FORMULA FOR TF-IDF


The term count in the given document is simply the number of times a given term appears in that document. This count is usually normalized to prevent a bias towards longer documents (which may have a higher term count regardless of the actual importance of that term in the document) to give a measure of the importance of the term within the particular document . Thus we have the term frequency tf(t, d).

tf(t, d) = t/TW,

Where t = Number of times a word repeated in Document.

TW= Total number of words in Documents.

The inverse document frequency is a measure of whether the term is common or rare across all documents. It is obtained by dividing the total number of documents by the number of documents containing the term, and then taking the logarithm of that quotient.

idf(D,df) = log(|D|/df)

D = Total Number of Documents.

df = Total number of documents which have that particular word

Mathematically the base of the log function does not matter and constitutes a constant multiplicative factor towards the overall result.

Then the tf-idf is calculated as


22

tf-idf = tf(t, d) * idf(D,df)

tf-idf = (t/TW)*(log(D/df))

A high weight in tf*idf is reached by a high term frequency (in the given document) and a low document frequency of the term in the whole collection of documents; the weights hence tend to filter out common terms.

23

STEPS FOR TF-IDF


The very first step which is performed in order to find tf-idf is to generate a set of text files for each document separately and perform STOP word removal on them. STOP words such as for, a, the, to etc. as shown on Fig.6.

In the next step we perform stemming. For e.g., we have a words such as care and careful the stemming is performed and these are counted as one word.

step1() gets rid of plurals and -ed or -ing. e.g. caresses -> caress ponies -> poni cats agreed -> cat -> agree

disabled -> disable

24

step2() turns terminal y to i when there is another vowel in the stem. step3() maps double suffices to single ones. so -ization ( = -ize plus -ation) maps to -ize etc.

step4() deals with -ic-, -full, -ness etc. similar strategy to step3. step5() takes off -ant, -ence etc., in context <c>vcvc<v>.

25

For a set of 100 blogs we found 4523 words which were present. Once the task of stemming is complete, we obtain the raw data then the number of occurrences of each word is counted for all the documents. Now tf-idf is performed as mentioned above and weight of every word is calculated for each blog entry as in Fig.7. A vector matrix is arranged in the format of .arff file which is fed to WEKA for classification as in Fig.8. WEKA trains on the first 20% of data and for then next 80% it classifies the dataset and gives out the result as in Fig.9.

26

Figure 4: Initialise Weight Matrix for tf-idf

27

Figure 5: Snapshot to Maketf-idf Matrix

28

Figure6: Stop words

29

Figure 7: Words and their corresponding weights

Figure 8: .arff file as input for WEKA

30

Figure 9: Output of .arff file in WEKA

31

POS TAGGER
Automatic assignment of descriptors to the given tokens is called Tagging. The descriptor is called tag. The tag may indicate one of the parts-of-speech, semantic information, and so on. So, tagging is a kind of classification. The process of assigning one of the parts of speech to the given word is called Parts of Speech tagging. It is commonly referred to as POS tagging. Parts of speech include nouns, verbs, adverbs, adjectives, pronouns, conjunction and their subcategories.

Example: Word:Paper,Tag:Noun Word:Go,Tag:Verb Word: Famous, Tag:Adjective

Note that some words can have more than one tag associated with. For example, chair can be noun or verb depending on the context.

Parts Of Speech tagger or POS tagger is a program that does this job. Taggers use several kinds of information: dictionaries, lexicons, rules, and so on. Dictionaries have category or categories of a particular word. That is a word may belong to more than one category. For example, run is both noun and verb. Taggers use probabilistic information to solve this ambiguity.

There are mainly two types of taggers: rule-based and stochastic. Rule-based taggers use hand-written rules to distinguish the tag ambiguity. Stochastic taggers are either HMM based, choosing the tag sequence which maximizes the product of
32

word likelihood and tag sequence probability, or cue-based, using decision trees or maximum entropy models to combine probabilistic features.

Ideally a typical tagger should be robust, efficient, accurate, tuneable and reusable. In reality taggers either definitely identify the tag for the given word or make the best guess based on the available information. As the natural language is complex it is sometimes difficult for the taggers to make accurate decisions about tags. So, occasional errors in tagging are not taken as a major roadblock to research. We use a POS tagger 3 to tag each token. The tagger has too many types of tags. For simplicity, we only keep track of the most common types of tags: NN"(noun), "VB"(verb), "JJ"(adjective), "PRP"(personal pronoun), "UH"(interjection), and "RB"(adverb).

33

ARCHITECTURE OF POS TAGGER


1. Tokenization: The given text is divided into tokens so that they can be used for further analysis. The tokens may be words, punctuation marks, and utterance boundaries.

2. Ambiguity look-up: This is to use lexicon and a guessor for unknown words. While lexicon provides list of word forms and their likely parts of speech, guessors analyze unknown tokens. Compiler or interpreter, lexicon and guessor make what is known as lexical analyzer.

3. Ambiguity Resolution: This is also called disambiguation. Disambiguation is based on information about word such as the probability of the word. For example, power is more likely used as noun than as verb. Disambiguation is also based on contextual information or word/tag sequences. For example, the model might prefer noun analyses over verb analyses if the preceding word is a preposition or article. Disambiguation is the most difficult problem in tagging.

34

Figure 10: Implementing Stanford POS tagger in Java program


35

Figure 11: Finding and counting the parts of speech


36

DESIGN OF POS TAGGER


The Stanford POS Tagger is designed to be used from the command line or programmatically via its API. There is a GUI interface, but it is for demonstration purposes only; most features of the tagger can only be accessed via the command line. To run the demonstration GUI you should be able to use any of the following 3 methods: 1) java -mx200m -classpath stanford-postagger.jar edu.stanford.nlp.tagger.maxent.MaxentTaggerGUI models/left3words-wsj-018.tagger 2) Providing your system gives java enough memory by default, you can also run it by either double-clicking the stanford-postagger.jar file, or giving the command: java -jar stanford-postagger.jar 3) Running the appropriate script for your operating system: stanford-postagger-gui.bat./stanford-postagger-gui.sh To run the tagger from the command line, you can start with the provided script appropriate for you operating system: ./stanford-postagger.sh models/left3words-wsj-0-18.tagger sample-input.txt stanford-postagger models\left3words-wsj-0-18.tagger sample-input.txt The output should match what is found in sample-output.txt

37

The tagger has three modes: tagging, training, and testing. Tagging allows you to use a pretrained model (two English models are included) to assign part of speech tags to unlabeled text. Training allows you to save a new model based on a set of tagged data that you provide. Testing allows you to see how well a tagger performs by tagging labelled data and evaluating the results against the correct tags. Many options are available for training, tagging, and testing. These options can be set using a properties file. To start, you can generate a default properties file by: java -classpath stanford-postagger.jar edu.stanford.nlp.tagger.maxent.MaxentTagger -genprops > myPropsFile.prop This will create the file myPropsFile.prop with descriptions of each option for the tagger and the default values for these options specified. Any properties you can specify in a properties file can be specified on the command line or vice versa. For further information, please consult the Javadocs (start with the entry for MaxentTagger, which includes a table of all options which may be set to configure the tagger and descriptions of those options). To tag a file using the pre-trained bidirectional model: java -mx300m -classpath stanford-postagger.jar edu.stanford.nlp.tagger.maxent.MaxentTagger -model models/bidirectional-distsimwsj-0-18.tagger -textFile sample-input.txt > sample-tagged.txt Tagged output will be printed to standard out, which you can redirect as above.

38

To train a simple model: java -classpath stanford-postagger.jar edu.stanford.nlp.tagger.maxent.MaxentTagger -prop propertiesFile -model modelFile -trainFile trainingFile To test a model: java -classpath stanford-postagger.jar edu.stanford.nlp.tagger.maxent.MaxentTagger -prop propertiesFile -model modelFile -testFile testFile

Figure 2: Screenshot of output of parts of speech

39

Figure 3: Counting the number of parts of speech for each document

40

Figure 12: Vector Matrix for Weka

41

Figure 13: Result of Weka

42

RESULT AND COMPARISON


We used all features from different feature classes along with our POS patterns as our pool of features. All results were obtained through percentage split 20%. This experiment tries to test the performance of naive Bayes classier on different feature types. 20 blogs are chosen from 100 blogs as train data set. The rest 80 blogs are used as test set. The results of naive bayes based classier are shown in Table. The columns stand for different data feature types.

Frequency Counter 64.54%

Tf-Idf 60.00%

Pos-Tagger 65.50%

We used ZeroR and Nave Bayes (NB) as learning algorithms.In all our experiments, we used accuracy as the evaluation measure as the two classes (male and female) are roughly balanced, and both classes are equally important. Method Used Frequency Counter Tf-Idf Pos-Tagger Naive Bayes 64.54% 60.00% 60.00% ZeroR 67.23% 62.50% 65.50%

Different methods give different results by using different learning algorithms. Observation and Discussion ZeroR performs the best but NB did not do so well. We may also say that Boolean feature values yielded better results than the TF scheme across all classifiers and feature selection methods. Feature selection is very useful. Without feature selection (All features), Classifier can't give a good result. No major changes are brought by any of the method used.

43

CONCLUSION
This Experiment studied the problem of gender classification. Although there have been several existing papers studying the problem, the current accuracy is still far from ideal. In this work, we followed the supervised approach and proposed two novel techniques to improve the current state-of-the-art. In particular, we used POS sequence patterns that are able to capture complex stylistic regularities of male and female authors. Since there are a large number features that have been considered, it is important to find a subset of features that have positive effects on the classification task. Experimental results based on a real-life blog data set demonstrated the effectiveness of the proposed techniques. They help achieve significantly higher accuracy than the current state-of-the-art techniques and systems. For the blog author gender classification task, we found that the best prediction accuracy we could achieve is 67.23%. This resulted is achieved using all features mentioned, Frequency counter as feature selection criterion, and ZeroR as the classifier. We can easily say that binary word feature is in general more effective than term frequency. Additional features slightly improve prediction accuracy in combination with feature selection mechanisms.

FUTURE WORK
In future many different methods such as MI, Chi-square (x2) and others can be incorporated with the existing one to increase the efficiency of the project. Different classification algorithms such as SVM, SVM_R can be used for better result. Increasing the window size for the experiment can produce more efficient result.
44

Uses
It can help the user find what topics or products are most talked about by males and females, and what products and services are liked or disliked by men and women. Knowing this information is crucial for market intelligence because the information can be exploited in targeted advertising and also product development It have many commercial application like production of cream, shampoo etc based on the need of male as well as female.

References
1. Bing Liu and Arjun Mukherjee. Improving Gender classification of blog authors . EMNLP 2010. 2. Mayur Rustagi, R. Rajendra Prasath , Sumit Goswami, and Sudeshna Sarkar.Learning Age and Gender of Blogger from Stylistic Variation. LNCS 5909 3. ICWSM 2009, Spinn3r Dataset (May 2009) Burger, J.D., Henderson, J.C.: An exploration of observable features related to blogger age. In: Proc. of the AAAI Spring Symposium on Computational Approaches to Analyzing Weblogs (2006) 4. Argamon, S., Koppel, M., Pennebaker, J. W., Schler, J. 2007. Mining the Blogosphere: Age, Gender and the varieties of self-expression , First Monday, 2007 firstmonday.org

45

5. Blum, A. and Langley, P. 1997. Selection of relevant features and examples in machine learning . Artificial Intelligence, 97(1-2):245-271. 6. Predicting gender from blog posts Cathy Zhang_and Pengyu Zhangy (December 10, 2010) 7. Kohavi, R. and John, G. 1997. Wrappers for feature subset selection . Artificial Intelligence, 97(1-2):273-324. 8. http://horicky.blogspot.in/2009/01/solving-tf-idf-using-map-reduce.html 9. http://pyevolve.sourceforge.net/wordpress/?tag=tf-idf 10. http://en.wikipedia.org/wiki/Tf*idf

46

Вам также может понравиться