Вы находитесь на странице: 1из 36

AUTOMATIC PORTAL MANAGEMENT AND

PORTAL-BASED INTELLIGENT SERVICE


DEVELOPMENT
(THAI TOURISM AS A TESTBED)

CS4922 Computer Science Project II


by

Mr. Than Htut Soe

Mr. Ye Myat Thein

20th February 2009

Advisor: Dr. MD. Maruf Hasan

Committee Members: Asst. Prof. Dr. Chakguy Prakasvudhisarn

Asst. Prof. Dr. Chutiporn Anutariya

Bachelor of Science in Computer Science

School of Technology

Shinawatra University
TABLE OF CONTENTS
Acknowledgements 4
Abstract 5
Chapter I
Introduction
a) Background 6
b) Problem statement 6
c) Motivation 6
Chapter II
Literature Review
a) Machine learning and algorithms in brief 7
b) How does KEA operate? 8
c) Vocabulary structure 11
d) Web crawling (Web spidering) 13
e) HTML tags cleaning 14
f) HTTP 14
g) GWT version 1.5 15
Chapter III
Requirements and Design
a) Brief requirements 16
b) System architecture design 17
c) Plan 18
Chapter IV
Implementation
a) Tools and techniques 19
b) Web crawling 19
c) HTML cleaning 20
d) Web crawling and HTML cleaning work flow 20
e) Keyword extraction architecture 22
f) Training mechanism 22
g) Classifier 23
h) Back-end processing database description 24
i) Backend end processing database management in GWT 25
Chapter V
Testing, Evaluation, and Summary
a) Evaluation 28
b) Conclusion 29
c) Future Works 29
d) References 31
Appendix
a) Sample Training File 32
b) Screen Shot of Back End Processing database management 33
c) Entity-Relation diagram for tourism domain 35
d) Entity-Relation diagram for education domain 36

2
LIST OF FIGURES

Fig.1. KEA training and extraction process architecture 10


Fig.2. KEA model building and extracting mechanisms 10
Fig.3. KEA text vocabulary structure 12
Fig.4. Simple Knowledge Organization System vocabulary structure 13
Fig.5. High-level architecture of a standard web crawler 14
Fig.6. Overall system architecture of the system 17
Fig.7. Gantt chart 18
Fig.8. Web crawling system architecture 21
Fig.9. Keywords extraction and classification system architecture 22
Fig.10. Classification block diagram 23
Fig.11. Back-end database schema 25
Fig.12. GWT web interface overview class diagram 27

3
Acknowledgements

First of all, we would like to thank Dr. Maruf Hasan for giving us the suggestions and
motivations to work on natural language processing which we were totally new to. To
Asst.Prof.Dr. Chakguy Prakasvudhisarn and Asst.Prof.Dr. Chutiporn Anutariya for supporting us
as committee members. Finally, we would like to thank our family and friends for their support
and encouragement.

4
Abstract

With the introduction of blogs and other kinds of user generated contents more and more
information about tourist destinations is available on the blogs and non commercial websites. But
these blogs are usually not properly structured and finding required information from them is
very tedious and inefficient. As tourism is major business in Thailand, it will be very beneficial
to keep the information from all the blogs in am easily accessible way. This research is focused
on application of Keyphrase Extraction Algorithm with controlled vocabulary and classification
technique to annotate and classify unstructured web pages especially from blogs. As for a web
admin striving to keep tourism related information on his/her website, this technique will
certainly enhance his/her ability to screen only the required and updated data efficiently. The
technique, introduced in this research, gives satisfactory results on text which are written like a
paragraph essay. Application of similar approach to education domain documents did not
perform well. Because documents in this domain normally contain lists of degrees, courses,
tuition fees, rather than paragraphs.

5
CHAPTER I

Introduction

a) Background

Web Portal, one of the popular services today; often function as a point of access to information on the
World Wide Web. Portals present information from diverse sources in a unified way. A web portal can
never be completely covered all the information on its domain. But the problem is how to keep updating
the information from these sources in an efficient way. Some part of this updating process can be
automated by application of Keyphrase Extraction Algorithm (KEA) and classification. In our project, we
are primarily concerned about the developing intelligent services for web portals by applying these
techniques. Our purpose is to explore and develop an intelligent web portal service using tourism in
Thailand domain and Education domain as a test case. Documents from the web in these two domains
will be preprocessed, assigned appropriate key phrases and classified. The approach should also be able to
extend its usage into other areas with minimum effort.

b) Problem Statement

Tourism is a major business in Thailand. The statistics from the Tourism Authority of Thailand shows
that in 2007 the total revenue from tourism is 547 billion Baht and the number of tourists is 14.6 million.
Due to this huge market, there are a lot of web sites available for tourism in Thailand which covers every
kind of services a tourist needs. But they are never complete and up-to-date. Furthermore, as blogs
becomes more popular some tourist destinations can be found nowhere else but blogs on. The diversity of
these sources of information on the blogs, makes it impossible for potential tourists to make use of them
effectively.

We will address the problem to present information from multiple sources on tourism in a unified way by
using natural language processing techniques. Furthermore, intelligent techniques can be applied to
provide personalized information to assist people to find information they wanted faster and easier.
Generation of users content from collaboration of users themselves is also a viable option for us in order
to create more useful content and different opinions.

Furthermore, the technique described in this paper can be applied to different domain with minimal
modification. In order to test this hypothesis we will use Education domain as a second testbed.

c) Motivation

Nowadays, current existing services on the web give reasonably complete information and involve a lot of
charming interfaces. But a weakness of most services is their failure to provide middle-class and low-
class facilities in Tourism. These low end facilities cannot be found in commercial websites but they can
be found on some travelers’ blogs. Especially in the times of economic crisis like this most tourists will
want to find cheaper facilities on the internet. If we can automate some process of finding the above
mentioned information from blogs then it will clearly be an advantage over other commercial websites on
tourism. A person striving a database do not have to search and browse through thousands of web pages.
Instead finding the required information can be automated by using web crawling, keyword extraction
and classification.

6
CHAPTER II

Literature Review

a) Machine learning and algorithms in brief

Initially, we need to define what learning is. Learning can be defined in multiple ways depending on the
context. The dictionary defines ―to learn‖ as follows:

To get knowledge of by study, experience, or being taught;


To become aware by information or from observation;
To commit to memory;
To be informed of, ascertain;
To receive instruction.

The meanings described above are not relevant when it comes to talking about machine, in our case, the
computer. So the machine learning is quite complex to be defined actually.

The convergence of computing and communication has produced a society that feeds on information. Yet
most of the information is in its raw form: data. If data is characterized as recorded facts, then information
is the set of patterns, or expectations, that underlie the data. There is a huge amount of information locked
up in databases—information that is potentially important but has not yet been discovered or articulated.

Machine learning is a subfield of Artificial Intelligence that is concerned with making the computer to be
capable of reasoning, thinking and doing things as if they understand actually by using algorithms and
techniques. The primary focus of the machine learning is to generate the rules and patterns from data.
Machine learning has the close connection with data mining, statistics, inductive reasoning, pattern
recognition, and theoretical computer science. Machine learning provides the technical basis of data
mining. It is used to extract information from the raw data in databases—information that is expressed in
a comprehensible form and can be used for a variety of purposes. The process is one of abstraction: taking
the data, warts and all, and inferring whatever structure underlies it. The applications of machine learning
can be seen in many areas and can vary with the objectives of the system.

Humans have the intuitive vision when they see the data. Machine learning also need such kind of ability
inside. In order to do that, complex mathematical models and strategies are used to imitate the natural
ability of human beings. The main important fact which is wrongly understood by many people is that
human intuition cannot be replaced with the machine no matter how complex and effective the algorithm
is. This fact is universally accepted in scientific society, at least so far. Some says that machine learning is
an attempt to automate parts of the scientific method. Methods such as mathematical calculation,
probability, etc in statistics are the core components of the machine learning. In truth, we should not look
for a dividing line between machine learning and statistics because there is a continuum—and a
multidimensional one at that—of data analysis techniques. Some derive from the skills taught in standard
statistics courses, and others are more closely associated with the kind of machine learning that has arisen
out of computer science. Historically, the two sides have had rather different traditions. If forced to point
to a single difference of emphasis, it might be that statistics has been more concerned with testing
hypotheses, whereas machine learning has been more concerned with formulating the process of
generalization as a search through possible hypotheses. But this is a gross oversimplification: statistics is
far more than hypothesis testing, and many machine learning techniques do not involve any searching at
all.
7
As I mentioned above, machine learning is concerned with algorithms which are the cornerstone.
Machine learning has several algorithms and they can be commonly classified as followed:

Supervised learning
Unsupervised learning
Semi-supervised learning
Reinforcement learning
Transduction
Learning to learn

Supervised machine learning is the learning technique for learning for pattern from training data. The
training data will contain inputs and related desired outputs. The primary task of the machine using this
learning scheme is to predict the data pattern or value on the unseen data. In order to do that, it must be
able to make reasoning and inducing from the seen training data.

Unsupervised machine learning can be defined simply as the opposite of its counterpart, the supervised
one. It also seeks to determine how the data are organized. It is distinguished from supervised learning
(and reinforcement learning) in that the learner is given only unlabeled examples. Unsupervised learning
is closely related to the problem of density estimation in statistics. However unsupervised learning also
contains many other techniques.

Semi-supervised one can be seen the middle entity between previous actors. It is done using both labeled
and unlabeled example. Many researchers have found out that unlabeled data, when used along with a
small amount of labeled data, can produce considerable improvement in learning accuracy. As a result of
this, the learning scheme becomes popular in machine learning.

In reinforcement learning, the algorithm learns a policy of how to act given an observation of the world.
Every action has some impact in the environment, and the environment provides feedback that guides the
learning algorithm. Transduction is similar to supervised learning, but does not explicitly construct a
function: instead, tries to predict new outputs based on training inputs, training outputs, and test inputs
which are available while training. In the final category of learning, the algorithm learns its own inductive
bias based on previous experience.

b) How does KEA operate?

In our project KEA is used for extracting the keywords of the documents. Actually, in our case we can
call keyword assignment instead of extraction for the reason that will be explained soon in this section.
KEA stands for Keyphrase Extraction Algorithm and was developed by Digital Libraries and Machine
Learning Labs, Computer Science Department, The University of Waikato which also developed WEKA
machine learning software package.

Keyphrases provide semantic metadata that summarize and characterize documents. KEA identifies
candidate keyphrases using lexical methods, calculates feature values for each candidate, and uses a
machine-learning algorithm to predict which candidates are good keyphrases. KEA is an algorithm for
extracting keyphrases from text documents. It can be either used for free indexing or for indexing with a
controlled vocabulary. KEA is implemented in Java and is platform independent. It is an open-source
software distributed under the GNU General Public License.

8
The machine learning scheme in KEA first builds a prediction model using training documents with
known keyphrases, and then uses the model to find keyphrases in new documents. A large test corpus is
usually used to evaluate Kea’s effectiveness in terms of how many author-assigned keyphrases are
correctly identified. Keyphrases describe the brief summary of the large document which is one of the
reasons it becomes popular to be used. In addition, keyphrases can help users get a feel for the content of
a collection, provide sensible entry points into it, show how queries can be extended, facilitate document
skimming by visually emphasizing important phrases; and offer a powerful means of measuring
document similarity. In the specific domain of keyphrases, there are two fundamentally different
approaches: keyphrase assignment and keyphrase extraction. In our case we will use only keyphrase
extraction. Keyphrase extraction chooses keyphrases from the text itself. It employs lexical and
information retrieval techniques to extract phrases from the document text that are likely to characterize
it. Kea’s extraction algorithm has two stages:

1. Training: create a model for identifying keyphrases, using training documents where the author’s
keyphrases are known.
2. Extraction: choose keyphrases from a new document, using the above model.

In training the system, it involves Input Cleaning (ASCII input files are filtered to regularize the text and
determine initial phrase boundaries), Phrase Identification, Case-folding and stemming (casefold all
words and stem them using the iterated Lovins method. This involves using the classic Lovins stemmer to
discard any suffix, and repeating the process on the stem that remains until there is no further changing).
In featuring (generating attributes of the keyphrases), it contains 4 feature values: [g] TFxIDF is a
measure describing the specificity of a term for this document under consideration, compared to all other
documents in the corpus. Candidate phrases that have high TFxIDF value are more likely to be
keyphrases. First occurrence is computed as the percentage of the document preceding the first
occurrence of the term in the document. Terms that tend to appear at the start or at the end of a document
are more likely to be keyphrases. Length of a phrase is the number of its component words. Two-word
phrases are usually preferred by human indexers. Node degree of a candidate phrase is the number of
phrases in the candidate set that is semantically related to this phrase. This is computed with the help of
the thesaurus. Phrases with high degree are more likely to be keyphrases. First for TFxIDF, Kea builds a
document frequency file for this purpose using a corpus of about 100 documents, for example. Stemmed
candidate phrases are generated from all documents in this corpus using the method described above. The
document frequency file stores each phrase and a count of the number of documents in which it appears.
The formula to calculate the TFxIDF is as follows. TFxIDF = [freq(P, D)/size(D)] x – log2 [df(P)/N] In
this case, freq(P,D) is the number of times P occurs in D, size(D) is the number of words in D, df(P) is the
number of documents containing P in the global corpus, N is the size of the global corpus. The second
feature, first occurrence, is calculated as the number of words that precede the phrase’s first appearance,
divided by the number of words in the document. The result is a number between 0 and 1 that represents
how much of the document precedes the phrase’s first appearance. In other features length and node
degree, they are primarily used for generating training set of the system and will be explored in deep later
in the second part of the project.

To select keyphrases from a new document, Kea determines candidate phrases and feature values, and
then applies the model built during training. The model determines the overall probability that each
candidate is a keyphrase, and then a post-processing operation selects the best set of keyphrases.

9
Fig.1. KEA training and extraction process architecture

The above figure illustrates the conceptual flow of KEA in our case. Global corpus can be compared as
controlled vocabulary and the vocabulary is used in both training and extraction process. The work flow
of the KEA can be viewed as followed.

Fig.2. KEA model building and extracting mechanisms

Model Builder will use Training Documents and Vocabulary to generate a model with the help of
Stemmer and KEAFilter. The model will later be used to extract the keys from new unseen documents.
10
c) Vocabulary structure

As I mentioned above, there are two approaches to complete the process. One is free indexing and
controlled indexing. In controlled indexing, we have to provide thesaurus, vocabulary. The vocabulary
has two formats: plain text and SKOS (Simple Knowledge Organization System). In plain text format,
three main files must be in the same folder. One file is named as EN file which contains all entities or
phrases about the particular domain and written as unique ID with respective name. Two files whose
name ends with USE and REL describe about relationships among entities. USE tells the connection
between non-descriptive term and descriptive term. On the other hand, REL describes the connection
between entities in one-to-many fashion. For example, Bangkok, National Palace, Capital of Thailand,
Siam Paragon are entities. Bangkok (non-descriptive term) can be defined by Capital of Thailand
(descriptive term) and Bangkok is related with others. The structure of the vocabulary using plain text
format can be visualized as followed.

Fig.3. KEA text vocabulary structure


In the diagram, A can be described by J and C can be described by L. A has the relationships with G and
I. C has the relationships with N and P.

Another style of vocabulary is SKOS. SKOS is a system of structured definition about vocabularies
initiated by W3C. It is a structural way of organizing the entities so that their relationships definition can

11
be defined meaningfully. The name SKOS was chosen to emphasize the goal of providing a simple yet
powerful framework for expressing knowledge organization systems in a machine-understandable way.

SKOS provides a model for expressing the basic structure and content of concept schemes such as
thesauri, classification schemes, and subject heading lists, taxonomies, folksonomies, and other types of
controlled vocabulary. As an application of the Resource Description Framework (RDF) SKOS allows
concepts to be documented, linked and merged with other data, while still being composed, integrated and
published on the World Wide Web.

In basic SKOS, conceptual resources (concepts) can be identified using URIs, labeled with strings in one
or more natural languages, documented with various types of notes, semantically related to each other in
informal hierarchies and association networks, and aggregated into distinct concept schemes.

In advanced SKOS, conceptual resources can be mapped to conceptual resources in other schemes and
grouped into labeled or ordered collections. Concept labels can also be related to each other. Finally, the
SKOS vocabulary itself can be extended to suit the needs of particular communities of practice.

The following figure illustrates the simple structure of SKOS. The figure is referenced from W3C official
documentation.

Fig.4. Simple Knowledge Organization System vocabulary structure

The diagram tells us this sentence:

12
The term, Economic cooperation is used for cooperating the works for economy by many people and it
can be broadly defined as Economic policy and narrowly related with Economic integration, European
economic cooperation, European industrial cooperation and Industrial cooperation.

This kind of structural framework will enable the information to be more meaningful for machine. As a
result, the machine cooperation and information processing will be more effective and efficient.

SKOS has many labeling techniques. Here, labeling means assigning some sort of token to a resource,
where the token is intended to be used to denote (label) the resource in natural language discourse and/or
in representations intended for human consumption. They are listed down as below.

Preferred and Alternative Lexical Labels


o They allow you to assign preferred and alternative lexical labels to a resource.
Hidden Lexical Labels
o It is a lexical label for a resource, where you would like that character string to be
accessible to applications performing text-based indexing and search operations, but you
would not like that label to be visible otherwise.
Multilingual Labeling
o It is used for multiple languages.
Symbolic Labeling
o Symbolic labeling means labeling a concept with an image.

d) Web crawling (Web spidering)

Web crawling is the process of creating copies of web pages of the Internet for processing purpose. I
program or a script that can perform web crawling is known as a web crawler. Web Crawlers can be used
for various purposes such as indexing for search engines, automating maintenance tasks on a Web site or
gathering specific types of information from Web pages. In this project, a open source web crawler is
configured and used for gathering web pages for the purpose of keyword extraction and classification.

Web crawling starts with a list of URLs to visit, called the seeds. As the crawler visits these URLs, it
identifies all the hyperlinks in the page and adds them to the list of URLs to visit which called the crawl
frontier. URLs from the frontier are recursively visited according to a set of policies set by the user of the
web crawler. Most web crawlers are available as free or open source. Their behavior can be customized
by altering policies which includes

Selection policy that states which pages to download,


Re-visit policy that states when to check for changes to the pages,
Politeness policy that states how to avoid overloading Web sites, and
Parallelization policy that states how to coordinate distributed Web crawlers.

13
Fig.5. High-level architecture of a standard web crawler[10]

e) HTML tags cleaning

Web pages contain a lot of formatting tags and scripts which are not suitable to be processed by natural
language processing and machine learning algorithms. All these tags must be cleaned up so that only
relevant text can be extracted. The open source cleaner used for this project is HtmlCleaner which is
open-source HTML parser written in Java. It can reorder individual elements and produces well-formed
XML from dirty or ill-formed HTML.

f) HTTP

The Hypertext Transfer Protocol (HTTP) is an application-level protocol for distributed, collaborative,
hypermedia information systems[11]. HTTP has been in use by the World Wide Web global information
initiative since 1990. It is a generic, stateless, protocol which can be used for many tasks beyond its use
for hypertext, such as name servers and distributed object management systems, through extension of its
request methods, error codes and headers. A feature of HTTP is the typing and negotiation of data
representation, allowing systems to be built independently of the data being transferred. Current version
of HTTP HTTP/1.1(rfc2616) which supports request pipelining, allowing multiple requests to be sent at
the same time, allowing the server to prepare for the workload and potentially transfer the requested
resources more quickly to the client[12]. HTTP defines eight methods (sometimes referred to as "verbs")
indicating the desired action to be performed on the identified resource. These methods are HEAD, GET,
POST, PUT, DELETE, TRACE, OPTIONS, CONNECT. Methods of interest in this project are HEAD
and GET. HEAD asks for the response without the response body. It is useful for retrieving meta-
information written in response headers, without having to transport the entire content. GET requests a
representation of the specified resource.

14
g) GWT version 1.5

The Google Web Toolkit (GWT) allows developers to rapidly develop and debug AJAX application in
java language. GWT enables reusable, efficient solutions to recurring Ajax challenges such as
asynchronous remote procedure calls, history management, bookmarking, and cross-browser portability.
GWT ―applicationCreator‖, a command-line utility shipped with GWT, automatically generates all the
files needed to start a GWT project. Any Java development tool can be used but it specifically can create
project for Eclipse. Several open-sources plug-in are available for making GWT development easier with
IDEs. E.g., GWT4NB for NetBeans, Cypal Studio for GWT for Eclipse and gwtDeveloper for
JDeveloper. Several third-party libraries for GWT exists such as Ext GWT, GWT Component Library,
GWT-Ext, GWT Widget Library, GWTiger, Rocket GWT, Dojo, SmartGWT etc[14].

Another relevant part of GWT framework in this project is the GWT Remote Procedure Call (RPC)
mechanism. This RPC mechanism makes it easy for a GWT application client to make a call to server-
side code. It achieved this by hiding all the plumbing necessary to create and consume services [16]. All the
proxy classes that handle RPC plumbing for making the server-side code invocation and for converting
data back and forth between the client and server are all generated automatically. So a service can be
simply implemented by its interface and its server-side implementations.

GWT supports two modes of executing applications namely,

Hosted mode runs application as Java bytecode within the Java Virtual Machine (JVM) on a host
Apache Tomcat web server process. It is typically used for development and debugging as it
supports hot swapping of code.
Web mode runs applications as pure JavaScript and HTML which are compiled from the Java
source by GWT. This mode is typically used for deployment on any web server of our choice.

GWT has provided many libraries which help developers to create interactive web application with ease.
Some of the libraries that relevant to this project are:

Gears 1.1 Library


Google AJAX Search 1.0 Library
Google Maps 1.0 Library
GWT Web UI class library

GWT packet contains four major components:

GWT Java-to-JavaScript Compiler


GWT Hosted Web Browser (Used in hosted mode)
JRE emulation library (javascript emulation in Java)
GWT Web UI class library (Many extensions exist for GWT Wigets)[15]

15
CHAPTER III

Requirements and Design

a) Brief requirements

[1.1] The system should accurately display the up-to-date information of the places where a lot of tourist
attractions are located.

[1.2] When displaying the information of the places, the system should be able to clarify the information
of the particular place (i.e. it should not give the ambiguous information or conflicting data).

[1.3] The system should be able to understand the natural language (in this case, it is English) to process
unstructured data.

[1.4] When the system extracts the required information from the result of crawling through the web, it
should be able to compare, analyse and decide which data are the most appropriate at this moment.

[1.6] For machine learning algorithm, it should be done in java language.

[1.8] The accuracy of summarizing, extracting, comparing and all other automated functional modules
should not be less than 50%.

[1.9] The system should be platform-independent.

[2.0] The system should be interactive.

[2.1] The total time taken of Natural Language Processing sub system should be less than 5 minutes.

16
b) System architecture design

Fig.6. Overall system architecture of the system

17
c) Plan

Fig.7. Gantt chart

18
CHAPTER IV

Implementation

a) Tools and techniques

 Programming Language
o Java programming language
o Java Server Pages
 Programs
o Google Web Toolkit 1.5.3
o Eclipse IDE
o Java Development Kit 1.6
o Java Run-time Environment 1.6
o MySQL Workbench 5.5
 Database
o MySQL
 Third-party libraries
KEA
HTMLcleaner2.1
itsucks0.3.1
Spring Application Framework
dom4j
jaxen-1.1.1
log4j-1.2.14

b) Web crawling

Web Crawling is necessary in our project in order to download web pages on the internet efficiently. For
this purpose, we’ve downloaded and customized an open source web crawler named itsucks 0.3.1. It’s
been customized to

ignore any form of unnecessary content


ignore going to external links
not to overload any server
respect robots.txt which specifies pages not to be accessed by web robots

All these customization options are stored in a single download temple file in XML format for a
download job. All available customization options in itsucks 0.3.1 are:

Simple rules such as:

Limitation of link depth


Limitation of links to follow
Limitation of time per job
Allowed Hostname filter (regular expression)
Regular Expression Filter to save only certain filetypes/names on disk
Special rules which now has one filter, File Size filter
19
Advanced Regular Expression Rules which is provided by a highly customizable filter chain can
hold multiple regular expressions. For every expression, actions can be defined when the regular
expression matches a URL and actions to be executed when the expression does not match.
Possible actions are: follow the URL (Accept), do not follow the URL (Reject), change the
priority of the URL
Content filter which can filter according to the content of text/html files.[13]

Furthermore, the source code of itsucks 0.3.1 has been modified and recompiled in order to avoid
unnecessary downloading and processing of downloaded and processed web pages. If they haven’t been
updated since the last time they were processed then they can be skipped and existing data can be used.
For the above mentioned purpose, HTTP headers are requested first and then lastUpdated date from the
response is checked. If the lastUpdated date of current page is not later than stored lastUpdated date of the
same page in the database then that page will not be processed. Example of itsucks download profile used
in this project can be found in the appendix section.

c) HTML cleaning

HTML cleaning is achieved by using the library html_cleaner2.1 to strip of some unnecessary tags as
well as to normalize ill-formed HTML which is sometimes found on HTML documents found on the
web. htmlCleaner2.1 behavior can be set by a setting up a number of parameters. For example, in order to
strip off style and script tags the code is as follows[17]

// take default cleaner properties


CleanerProperties props = cleaner.getProperties();
props.setNamespacesAware(false);
// customize cleaner's behavior with property setters
//remove unnecessary tags like script and style
props.setPruneTags("script,style");

After the HTML is cleaned htmlCleaner2.1 will produce and XML which can be navigated easily to
extract data with XPath. In the following code snippet an html file is cleaned first and then it is
transformed into an XML document. After that all <p> tag nodes are extracted which will later be
processed to get only the final resulting plain text data.

d) Web crawling and HTML cleaning work flow

This whole process downloads web pages from urls and then preprocesses them so that they will be
suitable to be processed by KEA and classification. A list of urls to start will be stored as the base urls
list. Then they will be downloaded and parsed for links to other resources. Newly discovered links will be
filtered according to the specification described in web crawling. If a url satisfied the criteria then it will
be added to list of unvisited urls for further processing.

For the purpose classification later in the process last Updated date, URL of the page, hostname are stored
in the xml file with the same file name but with ―.xml‖ extension. The reason for storing these files will
be explained in the classification work flow section.

20
After web crawler finished downloading all pages, they will be stripped of all HTML tags to extract only
useful text data. Then the textual data will be stored in the same folder with the same file name but with
different file extenstion ―.txt‖. For instance, a file with ―aWebPage.html‖ will have it’s corresponding
properties file ―aWebPage.xml‖ and cleaned text file ―aWebPage.txt‖.

node = cleaner.clean(htmlFile);
//transform the node to JDom
Document myJDom = new JDomSerializer(props, true).createJDom(node);
//create a XPath expression to extract all the paragraph tags
XPath x = XPath.newInstance("//p");
List paragraphList = x.selectNodes(myJDom);

Fig.8. Web crawling system architecture

21
e) Keyword extraction architecture

Fig.9. Keywords extraction and classification system architecture

f) Training mechanism

In training the system to be able to extract keywords from the related keywords, we defined 28 documents
(4 documents for each place). The contents are mainly from Wikipedia which we all accepted as
knowledge base. Other information sources are also from official information provider instead of blogs
and comments since the contents of the training documents should be correct. If we use blogs and
information from forums, it would be biased towards some users.

The average length of the documents is 346 lines. Keywords are manually assigned. Each document has
6 to 9 keywords which can uniquely identify the respective document. The files that contain keywords are
named exactly same as their particular full-text documents. For example, if the text file is ―aa.txt‖, then
the keywords file should be ―aa.key‖. The training files are plain text file for the efficiency of the process.
PDF (Portable Data Format) is also acceptable but the process will change pdf to text anyhow. A sample
training file and its keyword file are provided in the appendix.

As I mentioned earlier in Literature Review section, training is the heart of the whole process. (The detail
processing of training document in building model is already described in the Literature section) It is
needed for identifying candidate phrases, calculating features and learning in the model-building process
and required again in the extraction process for evaluating the performance of the model created. That is
training documents are used to test the accuracy of the model as seen data evaluation scheme.

In our project, we demonstrate the keyword extraction by using two different domains: tourism in
Thailand and educational domain. We already discussed about tourism one in the above. In educational
one, our main objective is to detect the document which contains information about university courses

22
and degrees. However, the complete application of it will not be implemented in this project. The
fundamental framework only is considered. Later on, we can use this model to make the system which
can keep track of newest information automatically. In order to do that, we will have to make another
component like classifier in the tourism domain. The classifier should use regular expression and other
techniques based on the keywords extracted in this project.

g) Classifier

The primary tasks of classifier in this case are

To decide which document belongs to which place.


To check whether the documents are latest or not.
To store in the database as plaintext if it is latest.

The above lists describe about classifier’s tasks. In doing so, it will save only the last updated information
in the database instead of storing all blindly.

Fig.10. Classification block diagram

Classifier will get the keywords and related category (place) from the database into the HashMap of the
memory for using throughout the process. The interaction with database is done by JDBC (Java Database
Connectivity). As we mentioned in the web crawling process, the resulting files are stored in a directory.
Each logical set contains three types of file format: original plaintext file which is cleaned from html,
keywords file which contains assigned keyword by KEA and xml file which carries metadata of the
document. From that directory, at first, the keywords files are analyzed by the classifier. It checks each
.key file whether it contains at least one keywords in the memory. If it does not contain any words, then it
simply neglect it and move to another file. But if it contains a word from predefined keys, it is stored in
the temporary list. After this process, the system will get the plaintext files and xml files for each
document in the previous list. Then it checks the latest updated date for each document. If the date is after
the one which is already stored in the database, it will replace otherwise, it does nothing. The
documentation of Classifier is provided in the appendix.

23
h) Back-end processing database description

We have made a database at the back-end of the system. This database is for storing stage in the abstract
view of our application and the database is quite small since we do not want the overhead of database
design concerns. As a matter of fact, we can use simple plain text storing of the documents. However, this
is the old approach and has a lot of weaknesses. The obvious concerns are data storing redundancy, data
consistency, data integrity and so on. If we cannot make the system to be aware of those concerns, the
unnecessary allocation of resources at the server side will happen. That is the primary reason of making
small and efficient database.

The database consists of 5 tables:

1. base_url
This stores the URL of interest by administrator of the system. The administrator can enter any
websites that he wants to extract information via web interface. That is the flexible feature of our
system instead of hard-coding the fixed url in code.
2. data_class
This serves like a type description of plain text data. It consists of controlled status and suspected
status at the moment. The former one is the plaintext data which are stored by using controlled
vocabulary. The latter one is for the documents which are stored by using free indexing approach.
This relation is quite important since in the future, if we specify the data class into 1st order
suspicion (first priority keywords that should be in the documents), 2nd order suspicion (first
priority keywords that should be in the documents) and so on.
3. keywords
This stores the keywords and related place name. It can accept more keywords from the
administrator. The data in the relation are used to be transferred to the memory when the
classifier processes the documents. The attributes are ids of the place referred from place relation
and keywords.
4. place
This table is for storing name of the places and unique IDs. Each data is assumed as a category or
place. The attributes are ids and place name.
5. plaintext_data
All the documents processed by the classifier are stored in here. The first three attributes together
serve as the primary key. The attributes are place ID referred from place relation, file name, status
of document, original url, domain name, last updated date, and plain text data.

The relationships between tables are shown in the ER diagram followed. Look at the figure.

24
Fig.11. Back-end database schema

i) Backend end processing database management in GWT

Back end database user interface is implemented using Google Web Toolkit 1.5.3. Using GWT results an
AJAX web application which is interactive. Furthermore, the resulting JavaScript is optimized,
obfuscated and compatible with both Internet Explorer and Mozilla. Client GUI interface is written using
GWT widgets which is translatable to JavaScript. Some widgets are styled with CSS. Client-side classes
are in the package ―org.ths.client‖ and server-side classes are in ―org.ths.server‖. ―org.ths.util‖ package
contains utility classes such as manipulating the JDBC database and transformation between non-
serializable classes to serializable classes.

GWT has its own RPC framework to build RPC service. The first step is to define a service is its
interface. In GWT, an RPC service is defined by an interface that extends the ―RemoteService‖ interface.
That interface is named as ―KeyWordService‖. It will define methods that the service is going to provide.
The parameters and return types of the methods must be serializable so that RPC will be able to handle
parameter passing and returning tasks.

The service implementation contains the code that executes when a service is invoked through one of its
methods. Since the service implementation lives on the server, it is not translated to JavaScript, as the
client-side code is. The service will run as Java bytecode, which means it will have access to the full Java
Platform libraries and any other third-party components you may want to integrate.
―KeyWordServiceImpl‖ extends ―RemoreServiceServlet‖ class. It contains codes that use a JDBC utility
class to provide services.

All RPC calls in GWT are asynchronous, which means they don't block while waiting for the call to
return. The code following the call executes immediately after the asynchronous call. When the call
25
completes, the code within a callback method will be executed. The callback object must contain two
methods: onFailure(Throwable) and onSuccess(T). The way you specify your callback methods is by
passing an AsyncCallback object to the service proxy class when you invoke one of the service's methods.
When the server call completes, one of these two methods will be called depending on whether the call
succeeded or failed.

It is necessary to create a new interface with new method definitions to add an AsyncCallback parameter
to all of our service methods. It is very similar to the original service interface and it must have the same
name as the service interface, with an Async added to the end It must be located in the same package as
the service interface and each method must have the same name and signature as the service interface,
except with no return type and an AsyncCallback object as the last parameter. ―KeyWordServiceAsync‖
specified the asynchronous service calls.

After the service is implemented with the above mentioned method, it can now be called in the client. The
first step is to create the service proxy class with GWT.create(Class method). Then set up the callback
method with a new instance of AsyncCallback object with two methods onSuccess(T) and
onFailure(Throwable). Finally, asynchronous call can be made by passing the required parameters and
the callback instance. The code that does this in our project is mentioned below.

private final KeyWordServiceAsync keywordServ = GWT.create(KeyWordService.class);


ServiceDefTarget endpoint = (ServiceDefTarget) keywordServ;


endpoint.setServiceEntryPoint("/keywordService"); // set the relative location of the
service
AsyncCallback<String[]> callback = new AsyncCallback<String[]>() {
public void onFailure(Throwable caught) {
// do something with errors
}

public void onSuccess(String[] result) {


for(int i = 0; i < result.length; i++)
{
// code hidden
// display the result places string array on the user
interface

}
}
};
keywordServ.getPlaces(callback); // make the asynchronous call with the callback
// that has been created

26
Translatable Java Code runs as JavaScript on client GWT Framework
lasses
Automatically generated
Written by us
<<interface>> <<interface>>
ServiceDefTarget KeyWordServiceAsync
c Java Code runs as byte code
on server (Apache Tomcat in
this case)

<<interface>>
KeyWordServiceProxy
RemoteServiceServlet

<<interface>> <<interface>>
KeyWordServiceImpl
RemoreService KeyWordService

Fig.12. GWT web interface overview class diagram

27
CHAPTER V

Testing, Evaluation, and Summary

a) Evaluation

In order to measure the accuracy of keyword extraction module, we divided the evaluation into two
schemes. One is testing on the training files and another is testing on unseen documents. Each of the
schemes will extract the keywords using model which have been built by using training files. The training
data set has 28 files and 4 documents for each place. The measures are precision, recall and F-measure as
in usual statistics. The following matrix and formulae are used to get the measures.

Keywords Not Keywords


Extracted a b
Not extracted c D

Precision = a/(a+b)

Recall = a/(a+c)

F-Measure = (2*(Precision*Recall))/(Precision+Recall)

In first scheme, the built model is used to extract keywords from training documents which were
previously used for the model itself. The precision for each file is calculated by the proportion between
total numbers of correctly identified key and total numbers of keywords extracted. The recall is the
proportion between total numbers of correctly identified key and total numbers of keywords which should
be extracted. Precision in this case is 0.822, recall is 0.841 and F-Measure is 0.831. The data involved and
the calculated result can be seen as below.

File names Correct keys Total keys Keys that should be extracted
Ay1 2 2 3
Ay2 2 4 3
Ay3 1 1 2
Bkk1 3 3 3
Bkk2 1 1 1
Bkk3 3 4 3
Cm1 5 5 5
Cm2 2 2 2
Cm3 2 4 2
Cr1 2 4 3
Cr2 2 2 2
Cr3 1 1 2
Ks1 1 1 1
Ks2 1 1 1
Ks3 1 1 1
Pty1 2 2 2
Pty2 2 2 2
Pty3 1 2 1
Phu1 1 1 1
Phu2 2 2 2
Phu3 0 0 2
TOTAL 37 45 44
Precision Recall F-Measure
0.822 0.841 0.831391461

28
In second scheme, the built model is used to extract keywords from new test documents which are totally
new to the model. The precision, recall and F-Measure are calculated as before. The precision in this case
is 0.68, recall is 0.889 and F-Measure is 0.771. The data involved and the calculated result can be seen as
below.

File names Correct keys Total keys extracted Keys that should be extracted
833010001 1 2 1
Ayutthaya2 1 4 2
Ayutthaya 1 2 1
Bangkok Pundit 1 1 3
bkk 1 1 1
bkk_file01121 1 1 1
Blog Pattaya, expat Pattaya 1 1 1
chiang_mai-thailand-travel-blogs-d641667-3 4 5 4
ChiangMai 4 1 2 2
ChiangRai 4 1 1 1
Life in Chiang Mai blog 3 5 3
Life in Pattaya blog 4 5 4
tpod 4 5 3
TOTAL 24 35 27

Precision Recall F-Measure


0.68 0.889 0.770579987

b) Conclusion

The technique described in this pager web crawling, preprocessing, keyword extraction and classification
works gave satisfactory results for our tourism domain. The keyphrase extraction mechanism in KEA
takes into consideration of TFxIDF, note degree and decretization of candidate keyphrases. So, KEA will
only be able to extract correct keyphrases from documents which are written like essays or paragraphs.

KEA does not perform well on extracting keyphrases from documents in education domain. Most of the
texts in that domain are not sentences but lists of courses, degrees and fees. That means KEA’s source
code must be modified in order to be able to effectively assign keyphrases to education domain
documents. One possible way to do so is to include some heuristics like regular expressions in keyphrase
assignment. For example, a phrase that starts with ―Bachelor of‖, ―Master of ‖ or ―Doctor of‖ is very
likely to be a degree title keyphrase.

c) Future works

The extracted keywords for education domain documents can now identify whether the document is
actually in education domain or not. These keywords can later be used in developing the components
which can extract all courses and subjects offered in a university probably with heuristics like regular
expression mentioned in the conclusion. In doing so, the up-to-date information about university courses
can be gained automatically. And if we can do the previous task we can make the education portal which
will have opportunities for both service provider and students. Example of education portal database is
provided in Appendix.

The system for tourism domain will be integrated with the tourism website which has its own database for
information about Thailand Tourism and provides other related services like air ticket finding, hotel and
accommodation, and so on. The draft ER-diagram of this plan can be seen in Appendix. The database
update can be made as manual update or automatic one by using Name-Entity recognition and other
techniques.

29
The system can be tuned to switch another domain by changing vocabulary and training documents and
so, it will be useful for every type of domains which should possess up-to-date information.

30
REFERENCES

1. Ian H. Witten & Eibe Frank, ―DATA MINING: Practical Machine Learning Tools and
Techniques.‖ 2nd Edition, Elsevier Publishing.
2. http://en.wikipedia.org/wiki/Machine_learning
3. Ethem Alpaydın (2004) Introduction to Machine Learning (Adaptive Computation and Machine
Learning), MIT Press, ISBN 0262012111
4. Witten, I.,*Paynter, G., *Frank, E.,* Gutwin†, C., and Nevill-Manning, C., KEA: Practical
Automatic Keyphrase Extraction, academic paper.
5. Turney, P., Learning to extract keyphrases from text, Information Retrieval, 1999
6. Lovins, J.B., Development of a stemming algorithm, Mechanical Translation and Computational
Linguistics, 11, 22-31, 1968
7. KEA Description in http://www.nzdl.org/Kea/description.html Sep 28.
8. Domingos, P. and Pazzani, M., On the optimality of the simple bayesian classifier under zero-one
loss, Machine Learning, 29 (2/3), 103-130, 1997
9. http://www.w3.org/TR/2005/WD-swbp-skos-core-guide-20051102/
10. http://en.wikipedia.org/wiki/Web_crawler
11. http://www.ietf.org/internet-drafts/draft-ietf-httpbis-p1-messaging-05.txt
12. http://www.ietf.org/rfc/rfc2616.txt
13. http://itsucks.sourceforge.net/about.php
14. http://en.wikipedia.org/wiki/Google_Web_Toolkit
15. Ed Burnette (2007). Google Web Toolkit Taking the pain out of Ajax, The Pragmatic Bookshelf
16. Perry, Bruce W (2007). Google Web Toolkit for Ajax. O'Reilly Short Cuts. O'Reilly
17. http://htmlcleaner.sourceforge.net/

31
Appendix:

a) Sample Training File

One of the training file for Ayutthaya is as followed.

The manually defined keywords from the document above are in the following file.

32
b) Screen Shot of Back End processing database management

1. Adding keywords for controlled vocabulary

2. Browsing and Searching keywords

33
3. Browsing text data by last updated date

4. Adding base urls

34
c) Entity-Relation diagram for tourism domain

35
d) Entity-Relation diagram for education domain

36

Вам также может понравиться