Вы находитесь на странице: 1из 63

identifying syntax and semantic relation between articles

Chapter 1

INTRODUCTION
The Web has undergone exponential growth since its birth, and this expansion has
generated a number of problems; in this paper we address two of these: 1. The proliferation of
documents that are identical or almost identical. 2. The instability of URLs. The basis of our
approach is a mechanism for discovering when two documents are "roughly the same"; that is,
for discovering when they have the same content except for modifications such as formatting,
minor corrections, webmaster signature, or logo. Similarly, we can discover when a document is
"roughly contained" in another. Applying this mechanism to the entire collection of documents
found by the AltaVista spider yields a grouping of the documents into clusters of closely related
items. As explained below, this clustering can help solve the problems of document duplication
and URL instability. The duplication problem arises in two ways: First, there are documents that
are found in multiple places in identical form. Some examples are FAQ (Frequently Asked
Questions) or RFC (Request For Comments) documents. The online documentation for popular
programs. Documents stored in several mirror sites. Legal documents. Second, there are
documents that are found in almost identical incarnations because they are:
1) Different versions of the same document.
2) The same document with different formatting.
3) The same document with site specific links, customizations or contact information.
4) Combined with other source material to form a larger document.

The instability problem arises when a particular URL becomes undesirable because: The
associated document is temporarily unavailable or has moved. The URL refers to an old version
and the user wants the current version. The URL is slow to access and the the user wants an
identical or similar document that will be faster to retrieve. In all these cases, the ability to find
documents that are syntactically similar to a given document allows the user to find other,
acceptable versions of the desired item. URNs (Uniform Resource Names) have often been
[Type text]

Page 1

identifying syntax and semantic relation between articles


suggested as a way to provide functionality similar to that outlined above. URNs are a
generalized form of URLs (Uniform Resource Locators). However, instead of naming a resource
directly - as URLs do by giving a specific server, port and file name for the resource - URNs
point to the resource indirectly through a name server. The name server is able to translate the
URN to the "best" (based on some criteria) URL of the resource. The main advantage of URNs is
that they are location independent. A single, stable URN can track a resource as it is renamed or
moves from server to server. A URN could direct a user to the instance of a replicated resource
that is in the nearest mirror site, or is given in a desired language. Unfortunately, progress
towards URN's has been slow. The mechanism we present here provides an alternative solution

Identical documents do not need to be handled specially in our algorithm, but they add to
the computational workload and can be eliminated quite easily. Identical documents obviously
share the same set of shingles and so, for the clustering algorithm, we only need to keep one
representative from each group of identical documents. Therefore, for each document we
generate a fingerprint that covers its entire contents. When we find documents with identical
fingerprints, we eliminate all but one from the clustering algorithm. After the clustering has been
completed, the other identical documents are added into the cluster containing the one kept
version. We can expand the collection of identical documents with the "lexically-equivalent"
documents and the "shingle-equivalent" documents. The lexically-equivalent documents are
identical after they have been converted to canonical form. The shingle-equivalent documents are
documents that have identical shingle values after the set of shingles has been selected.
Obviously, all identical documents are lexically-equivalent, and all lexically equivalent
documents are shingle equivalent. We can find each set of documents with a single fingerprint.
Identical documents are found with the fingerprint of the entire original contents. Lexicallyequivalent documents are found with the fingerprint of the entire canonicalized contents. Shingle
equivalent documents are found with the fingerprint of the set of selected shingles.

Objectives of Project- Modules of the Project


[1] Design and Development of Article Submission Algorithm which is used to submit the
articles.
[Type text]

Page 2

identifying syntax and semantic relation between articles


[2] Design and Development of Data Cleaning Algorithm which is used to remove the
unwanted data known as stop words.
[3] Design and Development of Tokenized Algorithm which is used to obtain tokens in a text
document.
[4] Design and Development of Syntactic Relation Algorithm to find the syntactic relations
between documents
[5] Design and Development of Semantic Relation Algorithm to find the semantic relations
between documents
[6] Design and Development of Score Computation Algorithm used to compute the scores of
the documents
[7] Design and Development of Similarity between Articles algorithm

Methodology

[Type text]

Page 3

identifying syntax and semantic relation between articles

Article
Submission

Data
Cleaning

similarity
measure

Token
Determination

Syntantic
Relation

Score
Computatio
n

Semantic
Relation

Fig: Stages for Relation Algorithm


The following goals are defined
1. Article Submission This module is responsible for storage of articles
2.

Data Cleaning- This module is used in order to remove stop words from the Article.

3. Tokenizations- This process in used to obtain all the keywords of the Article and assign
them a unique ID as well as the web site id.
4. Syntatic Relation This Module is responsible for finding the syntactic relation of
articles i.e verb,adverb and adjectives of articles
5. Semantic Relation This module is used to find out the various semantic relations i.e
hypernism
6. Score Computation This is used to measure the score with respect to syntax and
semantic relation

[Type text]

Page 4

identifying syntax and semantic relation between articles

Problem Definition
Semantic and syntactic relations play an important role of applications in recent years,
especially on Semantic Web, Information Retrieval, Information Extraction, and Question
Answering. Semantic and syntactic relations content main ideas in the sentences or paragraphs.
This project presents our proposed algorithms for identifying semantic and syntactic relations
between objects and their properties in order to enrich a domain specific ontology, namely
Computing Domain Ontology, which is used in Information extraction system

Previous Approach

Disadvantages of Previous Approach

Proposed Approach
The proposed approach is to automatically identify the syntactic and semantic relations
that might be found in text documents of articles of specific domain. Afterward, we extract these
relations in order to enrich domain specific ontology. This ontology can be used in many
applications, such as Information Retrieval, Information Extraction, and Question answering
focusing on computing domain. For this purpose, we propose a methodology, which combine
Natural Language Processing (NLP) and Matching Learning.

Methodology

[Type text]

Page 5

identifying syntax and semantic relation between articles

Article Submission
Identify Syntax Relations

View Articles
Tokenization

Find Verb, Adverb, Noun

Find the articles are


similar based on
Syntax and
Semantic

Data Cleaning

Stop word
Analysis

Identify Symantic Relations

Synonyms, hyponyms,
hypernyms of instance data

Fig: Methodology of the Project


Fig shows the Methodology of the project

Article Submission
The Article Submission is used for submitting the article with article name and article description

View Articles
This module is responsible for viewing the articles

Data Cleaning
This module is responsible for preprocessing and cleaning of the text data. The module
makes use of Stop words in order to perform the analysis and do the cleaning .Data Cleaning is
used for removing the stop words from each of the tweets and clean them. After the data cleaning
process is completed the clean data can be represented as a set

Stopwords
[Type text]

Page 6

identifying syntax and semantic relation between articles


These are the set of words which do not have any specific meaning. The data mining
forum has defined set of keywords. Stop words are words which are filtered out before or
after processing of natural language data (text). There is not one definite list of stop words which
all tools use and such a filter is not always used. The list of stopwords used in the algorithm are
as follows
a,able,about,across,after,all,almost,also,am,among,an,and,any,are,as,at,be,because,been,but,by,
can,cannot,could,dear,did,do,does,either,else,ever,every,for,from,get,got,had,has,have,he,her,hers
,him,his,how,however,i,if,in,into,is,it,its,just,least,let,like,likely,may,me,might,most,must,my,neithe
r,no,nor,not,of,off,often,on,only,or,other,our,own,rather,said,say,says,she,should,since,so,some,tha
n,that,the,their,them,then,there,these,they,this,tis,to,too,twas,us,wants,was,we,were,what,when,w
here,which,while,who,whom,why,will,with,would,yet,you,your

Syntax Analyzer

Syntactic relations are the relations between concepts or words in the sentence with
respect to verb, adverb

Identifying the Semantic Relations


Identifiyng semantic relations As mentioned above, the sentence layer also includes
sentences that are derived from synonyms, hyponyms and hypernyms of instances of ingredient
layer. We use WordNet to find a set of synonyms, hyponyms and hypernyms of instances from
ingredient layer. WordNet is an ontology that includes many different domains. However, we
only focus on computing domain.

Articles are similar based on Syntax and Semantic


For the articles find the syntax relations like verb, adverb and then semantic relations are
found based on Synonyms, Hyponyms and Hypernyms. If the syntax and semantic values are
found then if the value of similarity based on greater than 50% then they are considered same.

[Type text]

Page 7

identifying syntax and semantic relation between articles

Chapter2
LITERATURE SURVEY
In the paper [1] titled Syntactic clustering of the web the authors have developed an
efficient way to determine the syntactic similarity of files and have applied it to every document
on the World Wide Web. Using this mechanism, we built a clustering of all the documents that
are syntactically similar. Possible applications include a "Lost and Found" service, filtering the
results of Web searches, updating widely distributed web-pages, and identifying violations of
intellectual property rights.
In the paper [2] titled Efficient near-duplicate detection for Q&A forum the authors
propose that r addresses the issue of redundant data in large-scale collections of Q&A forums.
The authors propose and evaluate a novel algorithm for automatically detecting the nearduplicate Q&A threads. The main idea is to use the distributed index and Map Reduce
framework to calculate pair wise similarity and identify redundant data fast and scalable. The
proposed method was evaluated on a real-world data collection crawled from a popular Q&A
forum. Experimental results show that our proposed method can effectively and efficiently detect
near duplicate content in large web collections. Two distributed inverted index methods to
calculate similarities in parallel using Map Reduce framework. We defined the near duplicate
Q&A thread and used the evaluated signatures, parallel similarity calculating and a liner
combination method to extract near-duplications. Experimental results in the real-world
collection show that the proposed method can be effectively and efficiently used to detect nearduplicates. About 15.78% of Q&A threads contain more than one near duplicates in the
collection
In the paper[3] titled A Practical Approach for Relevance Measure of Inter-Sentenc the
authors propose that Many natural language processing tasks, such as text classification, text
clustering, text summarization, and information retrieval etc., cannot miss the step-relevance
measure of inter-sentence. However, many of the current NLP system always calculate not the
inter-sentence relevance but their similarity. In fact, similarity means differently from relevance.
[Type text]

Page 8

identifying syntax and semantic relation between articles


The similarity measure can be acquired by comparing the exterior tokens of inter-sentences, but
relevance measure can be obtained only by comparing the interior meaning of the sentences. In
this paper, we described a method to explore the quantified conceptual relations of word-pairs by
using the definition of a lexical item in modern Chinese standard dictionary, and proposed a
practical approach to measure the inter-sentence relevance. The results of the examples show that
our approach can solve the problem of how to measure the relevance of two sentences without
(or very low) similarity but with a certain relevance. This method is also compatible with the
current cosine similarity method.
In the paper [4] titled Detection and Optimized Disposal of Near Duplicate Pages the
authors describe that Search engine is an important tool for users to access network information
resources. However, a large number of duplicate and near-duplicate pages added user's burden.
Currently, search engines only remove duplicate pages, but have not yet any effective strategies
in detecting and disposing near-duplicate pages. This paper analyzed the existing algorithms to
select an appropriate algorithm to detect near-duplicate pages, and optimized the disposing
strategy to ensure that near-duplicate pages would not take up too much space in search results
while being used effectively. These will allow users to retrieve needed information more easily.
In the paper [6] titled Text Based Similarity Metrics and Delta for Semantic Web
Graphs the authors describe that Recognizing that two Semantic Web documents or graphs are
similar and characterizing their differences is useful in many tasks, including retrieval, updating,
version control and knowledge base editing. We describe several text-based similarity metrics
that characterize the relation between Semantic Web graphs and evaluate these metrics for three
specialc cases of similarity: similarity in classes and properties, similarity disregarding
differences in base-URIs, and versioning relation- ship. We apply these techniques for a special
use case - generating a delta between versions of a Semantic Web graph. We have evaluated our
system on several tasks using a collection of graphs from the archive of the Swoogle Semantic
Web search engine.

[Type text]

Page 9

identifying syntax and semantic relation between articles

In the paper [7] titled Sematic Representation and Search Techniques for Document
Retrieval Systems the authors describe that Nowadays, organizing a repository of documents
and resources for learning on a special field as Information Technology, together with search
techniques based on domain knowledge or document content is an urgent need in practice of
teaching, learning and researching. There have been several works related to methods of
organization and search by content. However, the results are still limited and insufficient to meet
users demand for semantic document retrieval. This paper presents a solution for the
organization of a repository that supports semantic representation and processing in search. The
proposed solution is a model that shows the integration of components such as an ontology
describing domain knowledge, a database of document repository, semantic representation for
documents and a file system; with problems, semantic processing techniques and advanced
search techniques based on measuring semantic similarity. The solution is applied to build a
document retrieval system in the field of Information Technology, with semantic search function
serving students, teachers, and manager as well
In the paper[8] titled SpotSigs: Robust and Efficient Near Duplicate Detection in Large
Web Collections the authors describe that political scientists who need to manually analyze
large Web archives of news sites, we present SpotSigs, a new algorithm for extracting and
matching signatures for near duplicate detection in large Web crawls. Our spot signatures are
designed to favor natural language portions of Web pages over advertisements and navigational
bars. The contributions of SpotSigs are twofold: 1) by combining stopword antecedents with
short chains of adjacent content terms, we create robust document signatures with a natural
ability to filter out noisy components of Web pages that would otherwise distract pure n-grambased approaches such as Shingling; 2) we provide an exact and efficient, selftuning matching
algorithm that exploits a novel combination of collection partitioning and inverted index pruning
for high-dimensional similarity search

[Type text]

Page 10

identifying syntax and semantic relation between articles

In the paper [9] titled Adaptive near-duplicate detection via similarity learning the
authors describe that present a novel near-duplicate document detection method that can easily
be tuned for a particular domain. Our method represents each document as a real-valued
sparse k-gram vector, where the weights are learned to optimize for a specified similarity
function, such as the cosine similarity or the Jaccard coefficient. Near-duplicate documents can
be reliably detected through this improved similarity measure. In addition, these vectors can be
mapped to a small number of hash-values as document signatures through the locality sensitive
hashing scheme for efficient similarity computation
In the paper [10] titled Learning to extract keyphrases from text the authors describe
that Recent commercial software, such as Microsoft?s Word 97 and Verity?s Search 97, includes
algorithms that automatically extract keyphrases from documents. In this paper, we approach the
problem of automatically extracting keyphrases from text as a supervised learning task. We treat
a document as a set of phrases, which the learning algorithm must learn to classify as positive or
negative examples of keyphrases
In the paper [11] titled Keyword Extraction from Documents Using a Neural Network Model
the authors describe that A document surrogate is usually represented in a list of words. Because
not all words in a document reflect its content, it is necessary to select important words from the
document that relate to its content. Such important words are called keywords and are selected
with a particular equation based on Term Frequency (TF) and Inverted Document Frequency
(IDF). Additionally, the position of each word in the document and the inclusion of the word in
the title should be considered to select keywords among words contained in the text. The
equation based on these factors gets too complicated to be applied to the selection of keywords

[Type text]

Page 11

identifying syntax and semantic relation between articles

Chapter 3

Software Requirement Specifications


2.1 Software Requirements Specifications
A Software Requirements Specification (SRS) is a complete description of the behavior
of the system to be developed. It includes the functional and non functional requirement for the
software to be developed. The functional requirement includes what the software should do and
non functional requirement include the constraint on the design or implementation of the system.
Requirements must be measurable, testable, related to identified needs or opportunities, and
defined to a level of detail sufficient for system design.
What the software has to do is directly perceived by its users either human users or
other software systems. The common understanding between the user and developer is captured
in requirements document. The writing of software requirement specification reduces
development effort, as careful review of the document can reveal omissions, misunderstandings,
and inconsistencies early in the development cycle when these problems are easier to correct.
The SRS discusses the product but not the project that developed it; hence the SRS serves as a
basis for later enhancement of the finished product. The SRS may need to be altered, but it does
provide a foundation for continued production evaluation.

Resource Requirement
Netbean IDE 6.9.1 : Netbean is a multi-language software development environment comprising
an integrated development environment (IDE) and an extensible plug-in system. It is written
primarily in Java and can be used to develop applications in Java and, by means of the various
plug-ins, in other languages as well, including C, C++, COBOL, Python, Perl, PHP, and others.
[Type text]

Page 12

identifying syntax and semantic relation between articles


Netbean employs plug-ins in order to provide all of its functionality on top of (and including) the
runtime system, in contrast to some other applications where functionality is typically hard
coded. The Netbean SDK includes the Netbean java development tools (JDT), offering an IDE
with a built-in incremental Java compiler and a full model of the Java source files. This allows
for advanced refactoring techniques and code analysis. The IDE also makes use of a workspace,
in this case a set of metadata over a flat file space allowing external file modifications as long as
the corresponding workspace "resource" is refreshed afterwards.
Java Development Kit : The Java Development Kit (JDK) is an Oracle Corporation product
aimed at Java developers. Since the introduction of Java, it has been by far the most widely used
Java SDK. On 17 November 2006, Sun announced that it would be released under the GNU
General Public License (GPL), thus making it free software. This happened in large part on 8
May 2007; Sun contributed the source code to the Open JDK.
The JDK has as its primary components a collection of programming tools, including:

java the loader for Java applications. This tool is an interpreter and can interpret the
class files generated by the javac compiler. Now a single launcher is used for both
development and deployment. The old deployment launcher, jre, no longer comes with
Sun JDK, and instead it has been replaced by this new java loader.

javac the compiler, which converts source code into java Bytecode

appletviewer this tool can be used to run and debug Java applets without a web browser

apt the annotation-processing tool.

extcheck a utility which can detect JAR-file conflicts

idlj the IDL-to-Java compiler. This utility generates Java bindings from a given java
IDL file.

javadoc the documentation generator, which automatically generates documentation


from source code comments.

[Type text]

Page 13

identifying syntax and semantic relation between articles

jar the archiver, which packages related class libraries into a single jar file. This tool
also helps manage JAR files.

javah the C header and stub generator, used to write native methods

javap the class file disassembler

javaws the java web start launcher for JNLP applications

jconsole Java Monitoring and Management Console

jdb the debugger

jhat Java Heap Analysis Tool (experimental)

jinfo This utility gets configuration information from a running Java process or crash
dump. (experimental)

jmap This utility outputs the memory map for Java and can print shared object memory
maps or heap memory details of a given process or core dump. (experimental)

jps Java Virtual Machine Process Status Tool lists the instrumented HotSpot Java
Virtual Machines (JVMs) on the target system. (experimental)

jrunscript Java command-line script shell.

jstack utility which prints Java stack traces of Java threads (experimental)

jstat java virtual machine statistics monitoring tool (experimental)

jstatd jstat daemon (experimental)

policytool the policy creation and management tool, which can determine policy for a
Java runtime, specifying which permissions are available for code from various sources

VisualVM visual tool integrating several command line JDK tools and lightweight
performance and memory profiling capabilities.

wsimport generates portable JAX-WS artifacts for invoking a web service.

[Type text]

Page 14

identifying syntax and semantic relation between articles

xjc Part of the Java API for XML Binding (JAXB) API. It accepts an XML schema and
generates Java classes.

Experimental tools may not be available in future versions of the JDK.


The JDK also comes with a complete Java Runtime Environment, usually called a private
runtime, due to the fact that it is separated from the "regular" JRE and has extra contents. It
consists of a Java Virtual Machine and all of the class libraries present in the production
environment, as well as additional libraries only useful to developers, such as the
internationalization libraries and the IDL libraries.
Copies of the JDK also include a wide selection of example programs demonstrating the use of
almost all portions of the Java API.
Swing: The Java Foundation Classes (JFC) consists of five major parts: AWT, Swing, and
Accessibility, Java 2D, and Drag and Drop. Java 2D has become an integral part of AWT, Swing
is built on top of AWT, and Accessibility support is built into Swing. The five parts of JFC are
certainly not mutually exclusive, and Swing is expected to merge more deeply with AWT in
future versions of Java. Swing is a set of classes that provides more powerful and flexible
components than are possible with the AWT. In addition to the familiar components, Swing
supplies tabbed panes, scroll panes, trees, and tables. It provides a single API capable of
supporting multiple look-and feels so that developers and end-users are not locked into a single
platforms look-and-feel. The Swing library makes heavy use of the MVC software design
pattern, which conceptually decouples the data being viewed from the user interface controls
through which it is viewed. Swing possesses several traits such as1. Platform-independence
2.Extensibility 3.Component-oriented 4.Customizable 5. Configurable 6. Look and feel. Platform
independence both in terms of its expression and its implementation, extensibility which allows
for the "plugging" of various custom implementations of specified framework interfaces Users
can provide their own custom implementation of these components to override the default
implementations. Component-orientation allows responding to a well-known set of commands
specific to the component. Specifically, Swing components are Java Beans

components,

compliant with the Java Beans Component Architecture specifications. Through customizable
[Type text]

Page 15

identifying syntax and semantic relation between articles


feature users will programmatically customize a standard Swing component by assigning specific
borders, colors, backgrounds, opacities, etc, configurable that allows Swing to respond at runtime
to fundamental changes in its settings. Finally look and feel allows one to specialize the look and
feel of widgets, by modifying the default via runtime parameters deriving from an existing one,
by creating one from scratch, or, beginning with J2SE 5.0, by using the Look and Feel which is
configured with an XML property file.
J2EE Platform
As you might be already knowing, J2EE is a platform for executing server side Java
applications. Before J2EE was born, server side Java applications were written using vendor
specific APIs. Each vendor had unique APIs and architectures. This resulted in a huge learning
curve for Java developers and architects to learn and program with each of these API sets and
higher costs for the companies. Development community could not reuse the lessons learnt in the
trenches. Consequently the entire Java developer community was fragmented,isolated and
stunted thus making very difficult to build serious enterprise applications in Java. Fortunately the
introduction of J2EE and its adoption by the vendors has resulted in standardization of its APIs.
This in turn reduced the learning curve for server side Java developers. J2EE specification
defines a whole lot of interfaces and a few classes. Vendors (like BEA and IBM for instance)
have provided implementations for these interfaces adhering to the J2EE specifications. These
implementations are called J2EE Application Servers.
The J2EE application servers provide the infrastructure services such as threading,
pooling and transaction management out of the box. The application developers can thus
concentrate on implementing business logic. Consider a J2EE stack from a developer
perspective. At the bottom of the stack is Java 2 Standard Edition (J2SE). J2EE Application
Servers run in the Java Virutal Machine (JVM) sandbox. They expose the standard J2EE
interfaces to the application developers. Two types1 of applications can be developed and
deployed on J2EE application servers Web applications and EJB applications.
These applications are deployed and executed in containers. J2EE specification defines
containers for managing the lifecycle of server side components. There are two types of
[Type text]

Page 16

identifying syntax and semantic relation between articles


containers - Servlet containers and EJB containers. Servlet containers manage the lifecycle of
web applications and EJB containers manage the lifecycle of EJBs.

J2EE web application


Any web application that runs in the servlet container is called a J2EE web application.
The servlet container implements the Servlet and JSP specification. It provides various entry
points for handling the request originating from a web browser. There are three entry points for
the browser into the J2EE web application - Servlet, JSP and Filter. You can create your own
Servlets by extending the javax.servlet.http.HttpServlet class and implementing the doGet() and
doPost() method. You can create JSPs simply by creating a text file containing JSP markup
tags.one web.xml file.text file containing JSP markup tags. You can create Filters by
implementing the javax.servlet.Filter interface. The servlet container becomes aware of Servlets
and Filters when they are declared in a special file called web.xml .A J2EE web application has
exactly one web.xml file

A servlet is the most basic J2EE web component. It is managed by the servlet container. All
servlets implement the Servlet interface directly or indirectly. In general terms, a servlet is the
endpoint for requests adhering to a protocol. However, the Servlet specification mandates
implementation for servlets that handle HTTP requests only. But you should know that it is
possible to implement the servlet and the container to handle other protocols such as FTP too.
When writing Servlets for handling HTTP requests, you generally subclass HttpServlet class.
HTTP has six methods of request submission GET, POST, PUT, HEAD
and DELETE. Of these, GET and POST are the only forms of request submission relevant to
application developers. Hence your subclass of HttpServlet should implement two methods
doGet() and doPost() to handle GET and POST respectively
[Type text]

Page 17

identifying syntax and semantic relation between articles

Presentation Tier Strategies


Technologies used for the presentation tier can be roughly classified into three
categories:
1. Markup based Rendering (e.g. JSPs)
2. Template based Transformation (e.g. Velocity, XSLT)
3. Rich content (e.g. Macromedia Flash, Flex, Laszlo)

Markup based Rendering


JSPs are perfect examples of markup based presentation tiers. In markup based
presentation, variety of tags are defined (just like HTML tags). The tag definitions may be purely
for presentation or they can contain business logic. They are mostly client tier specific. E.g. JSP
tags producing HTML content. A typical JSP is interpreted in the web container and the
consequent generation of HTML. This HTML is then rendered in the web browser.

In the last section, you saw how Servlets produced output HTML in addition to executing
business logic. So why arent Servlets used for presentation tier? The answer lies in the
separation of concerns essential in real world J2EE projects. Back in the days when JSPs didnt
exist, servlets were all that you had to build J2EE web applications. They handled requests from
the browser,invoked middle tier business logic and rendered responses in HTML to the browser.
Now thats a problem. A Servlet is a Java class coded by Java programmers. It is okay to handle
browser requests and have business and presentation logic in the servlets since that is where they
belong. HTML formatting and rendering is the concern of page author who most likely does not
know Java. So, the question arises, how to separate these two concerns intermingled in Servlets?
JSPs are the answer to this dilemma. JSPs are servlets in disguise!

[Type text]

Page 18

identifying syntax and semantic relation between articles


The philosophy behind JSP is that the page authors know HTML. HTML is a markup
language. Hence learning a few more markup tags will not cause a paradigm shift for the page
authors. At least it is much easier than learning Java and OO! JSP provides some standard tags
and java programmers can provide custom tags. Page authors can write server side pages by
mixing HTML markup and JSP tags. Such server side pages are called JSPs. JSPs are called
server side pages because it is the servlet container that interprets them to generate HTML. The
generated HTML is sent to the client browser.
JSPs are server side pages. Server side pages in other languages are parsed every time they are
accessed and hence expensive. In J2EE, the expensive parsing is replaced by generating Java
class from the JSP. The first time a JSP is accessed, its contents are parsed and equivalent Java
class is generated and subsequent accesses are fast as a snap. Here is some twist to the story. The
Java classes that are generated by parsing JSPs are nothing but Servlets! In other words, every
JSP is parsed at runtime (or precompiled) to
generate Servlet classes.

Presentation Logic and Business Logic Whats the difference?


The term Business Logic refers to the middle tier logic the core of the system usually
implemented as core JAVA. The code that controls the JSP navigation, handles user inputs and
invokes appropriate business logic is referred to as Presentation Logic. The actual JSP the front
end to the user contains html and custom tags to render the page and as less logic as possible. A
rule of thumb is the dumber the JSP gets, the easier it is to maintain. In reality however, some of
the presentation logic percolates to the actual JSP making it tough to draw a line between the
two.
Model 1 architecture is the easiest way of developing JSP based web applications. It
cannot get any easier. In Model 1, the browser directly accesses JSP pages. In other words, user
requests are handled directly by the JSP. Consider a HTML page with a hyperlink to a JSP. When
user clicks on the hyperlink, the JSP is directly invoked. This is shown in Figure
[Type text]

Page 19

identifying syntax and semantic relation between articles

The servlet container parses the JSP and executes the resulting Java servlet. The JSP
contains embedded code and tags to access the Model JavaBeans. The Model JavaBeans contains
attributes for holding the HTTP request parameters from the query string. In addition it contains
logic to connect to the middle tier or directly to the database using JDBC to get the additional
data needed to display the page. The JSP is then rendered as HTML using the data in the Model
JavaBeans and other Helper classes and tags.
Problems with Model 1 Architecture
Model 1 architecture is easy. There is some separation between content (Model JavaBeans) and
presentation (JSP). This separation is good enough for smaller applications. Larger applications
have a lot of presentation logic. In Model 1 architecture, the presentation logic usually leads to a
significant amount of Java code embedded in the JSP in the form of scriptlets. This is ugly and
maintenance nightmare even for experienced Java developers. In large applications, JSPs are
developed and maintained by page authors. The intermingled scriptlets and markup results in
unclear definition of roles and isvery problematic.

[Type text]

Page 20

identifying syntax and semantic relation between articles


Application control is decentralized in Model 1 architecture since the next page to be displayed
is determined by the logic embedded in the current page. Decentralized navigation control can
cause headaches. All this leads us to Model 2 architecture of designing JSP pages

Model 2 Architecture MVC


The Model 2 architecture for designing JSP pages is in reality, Model View Controller (MVC)
applied to web applications. Hence the two terms can be used interchangeably in the web world.
MVC originated in SmallTalk and has since made its way into Java community. Model 2
architecure and its derivatives are the cornerstones for all serious and industrial strength web
applications designed in the real world. Hence it is essential for you understand this
paradigmthoroughly. Figure shows the Model 2 (MVC) architecture.

The main difference between Model 1 and Model 2 is that in Model 2, a controller handles the
user request instead of another JSP. The controller is implemented as a Servlet. The following
steps are executed when the user submits the request.
1. The Controller Servlet handles the users request. (This means the hyperlink
in the JSP should point to the controller servlet).
2. The Controller Servlet then instantiates appropriate JavaBeans based on the
request parameters (and optionally also based on session attributes).
3. The Controller Servlet then by itself or through a controller helpercommunicates with the
middle tier or directly to the database to fetch the required data.
4. The Controller sets the resultant JavaBeans (either same or a new one) in one
of the following contexts request, session or application.

[Type text]

Page 21

identifying syntax and semantic relation between articles


5. The controller then dispatches the request to the next view based on the
request URL.
6. The View uses the resultant JavaBeans from Step 4 to display data.
The sole function of the JSP in Model 2 architecture is to display the data from the JavaBeans
set in the request, session or application scopes.

Advantages of Model 2 Architecture


Since there is no presentation logic in JSP, there are no scriptlets. This means lesser
nightmares. [Note that although Model 2 is directed towards elimination ofscriptlets, it does not
architecturally prevent you from adding scriptlets. This has led to widespread misuse of Model 2
architecture.
With MVC you can have as many controller servlets in your web application. In fact you
can have one Controller Servlet per module. However there are several advantages of having a
single controller servlet for the entire web application.

[Type text]

Page 22

identifying syntax and semantic relation between articles


In a typical web application, there are several tasks that you want to do for every
incoming request. For instance, you have to check if the user requesting an operation is
authorized to do so. You also want to log the users entry and exit from the web application for
every request. You might like to centralize the logic for dispatching requests to other views. The
list goes on. If you have several controller servlets, chances are that you have to duplicate the
logic for all the above tasks in all those places. A single controller servlet for the web application
lets you centralize all the tasks in a single place. Elegant code and easier to maintain.
Web applications based on Model 2 architecture are easier to maintain and extend since
the views do not refer to each other and there is no presentation logic in the views. It also allows
you to clearly define the roles and responsibilities in large projects thus allowing better
coordination among team members.
Controller gone bad Fat Controller
If MVC is all that great, why do we need Struts after all? The answer lies in the difficulties
associated in applying bare bone MVC to real world complexities. In medium to large
applications, centralized control and processing logic in the servlet the greatest plus of MVC is
also its weakness. Consider a mediocre application with 15 JSPs. Assume that each page has five
hyperlinks (or five form submissions). The total number of user requests to be handled in the
application is 75. Since we are using MVC framework, a centralized controller
servlet handles every user request. For each type of incoming request there is if block in the
doGet method of the controller Servlet to process the request and dispatch to the next view. For
this mediocre application of ours, the controller Servlet has 75 if blocks. Even if you assume that
each if block delegates the request handling to helper classes it is still no good. You can only
imagine how bad it gets for a complex enterprise web application. So, we have a problem at
hand. The Controller Servlet that started out as the greatest thing next to sliced bread has gone
bad. It has put on a lot of weight to become a Fat Controller.

[Type text]

Page 23

identifying syntax and semantic relation between articles


MVC with configurable controller
When application gets large you cannot stick to bare bone MVC. You have to extend it somehow
to deal with these complexities. One mechanism of extending MVC that has found
widespread adoption is based on a configurable controller Servlet. The MVC with configurable
controller

servlet

is

shown

in

Figure

When the HTTP request arrives from the client, the Controller Servlet looks up in a
properties file to decide on the right Handler class for the HTTP request. This Handler class is
referred to as the Request Handler. The Request Handler contains the presentation logic for that
HTTP request including business logic invocation. In other words, the Request Handler does
everything that is needed to handle the HTTP request. The only difference so far from the bare
bone MVC is that the controller servlet looks up in a properties file to instantiate the Handler
instead of calling it directly

[Type text]

Page 24

identifying syntax and semantic relation between articles


At this point you might be wondering how the controller servlet would know to
instantiate the appropriate Handler. The answer is simple. Two different HTTP requests cannot
have the same URL. Hence you can be certain that the URL uniquely identifies each HTTP
request on the server side and hence each URL needs a unique Handler. In simpler terms, there is
a one-to-one mapping between the URL and the Handler class. This information is stored as keyvalue pairs in the properties file. The Controller Servlet loads the properties file on startup to find
the appropriate Request Handler for each incoming URL request.
The controller servlet uses Java Reflection to instantiate the Request Handler. However
there must be some sort of commonality between the Request Handlers for the servlet to
generically instantiate the Request Handler. The commonality is that all Request Handler classes
implement a common interface. Let us call this common interface as Handler Interface. In its
simplest form, the Handler Interface has one method say, execute(). The controller servlet reads
the properties file to instantiate the Request Handler
The Controller Servlet instantiates the Request Handler in the doGet() method and
invokes the execute() method on it using Java Reflection. The execute() method invokes
appropriate business logic from the middle tier and then selects the next view to be presented to
the user. The controller servlet forwards the request to the selected JSP view. All this happens in
the doGet() method of the controller servlet. The doGet() method lifecycle never changes.
What changes is the Request Handlers execute() method. You may not have realized it,
but you just saw how Struts works in a nutshell! Struts is a controller servlet based configurable
MVC framework that executes predefined methods in the handler objects. Instead of using a
properties file Struts uses XML to store more useful information
Struts

[Type text]

Page 25

identifying syntax and semantic relation between articles

In Struts, there is only one controller servlet for the entire web application. This
controller servlet is called ActionServlet and resides in the package org.apache.struts.action.
It intercepts every client request and populates an ActionForm from the HTTP request
parameters. ActionForm is a normal JavaBeans class. It has several attributes corresponding to
the HTTP request parameters and getter, setter methods for those attributes. You have to create
your own ActionForm for every HTTP request handled through the Struts framework by
extending the org.apache.struts.action.ActionForm class.
For the lack of better terminology, let us coin a termto describe the classes such as ActionForm
View Data Transfer Object. View Data Transfer Object is an object that holds the data from html
page and transfers it around in the web tier framework and application classes.
The ActionServlet then instantiates a Handler. The Handler class name is obtained from
an XML file based on the URL path information. This XML file is referred to as Struts
configuration file and by default named as struts-config.xml.

[Type text]

Page 26

identifying syntax and semantic relation between articles


The Handler is called Action in the Struts terminology. This class is created by extending the
Action class in org.apache.struts.action package. The Action class is abstract and defines a single
method called execute(). You override this method in your own Actions and invoke the business
logic in this method. The execute() method returns the name of next view (JSP) to be shown to
the user. The ActionServlet forwards to the selected view.

Now, that was Struts in a nutshell. Struts is of-course more than just this. It is a full-fledged
presentation framework. Throughout the development of the application, both the page author
and the developer need to coordinate and ensure that any changes to one area are appropriately
handled in the other. It aids in rapid development of web applications by separating the concerns
in projects.For instance, it has custom tags for JSPs. The page author can concentrate on
developing the JSPs using custom tags that are specified by the framework. The application
developer works on creating the server side representation of the data and its interaction with a
back end data repository. Further it offers a consistent way of handling user input and processing
it.

2.2. Operating Environment


2.2.1. Hardware Requirements
The hardware requirements of the project are summarized in the following table
Sl No
1
2
3
4
5
6
7
8
9

[Type text]

Parameter
RAM
Hard Disk
Java Development Kit-Version
Database
Database Front End
Tool For Java Development
Front End Technology
Framework
Sever

Description
500MB-1GB
120GB-160GB
JDK 1.5
MySQL
Heildi SQL/Toad For MySQL
Ecclipse
JSP
Spring-Framework
Tomcat8.0

Page 27

identifying syntax and semantic relation between articles

2.2.2. Software Requirements


The software requirements is summarized in the following table
Sl No
1
2
3
4
5
6

Parameter Name
Development Language
Java Development Kit Version
Java Run Time Environment
Database for Routing Tables Backend
Database Front End for Routing Tables
Database Front End for exporting Excel

Parameter Value
JAVA
Jdk 1.6
Jre 6
MySQL
Heildi SQL
Toad For MySQL

7
8
9
11
12
13

Sheets
Development Tool
Sever Type
Web Server
Framework Used
View Technology Used
Designing

Eccilpse
Web Server
Tomcat 6.0
Structs Framework
Java Server Pages
Cascading Style Sheets

2.3. Functional Requirements


The following are the functional requirements of the project
1. Article Submission This module is responsible for storage of articles
2.

Data Cleaning- This module is used in order to remove stop words from the Article.

3. Tokenizations- This process in used to obtain all the keywords of the Article and assign
them a unique ID as well as the web site id.

[Type text]

Page 28

identifying syntax and semantic relation between articles


4. Syntatic Relation This Module is responsible for finding the syntactic relation of
articles i.e verb,adverb and adjectives of articles
5. Semantic Relation This module is used to find out the various semantic relations i.e
hypernism
6. Score Computation This is used to measure the score with respect to syntax and
semantic relation

2.4. Non functional requirements


Interface requirements
How will the new system interface with its environment?
User interfaces and user-friendliness
Interfaces with other systems

Performance requirements
time/space bounds
workloads, response time, throughput and available storage space
e.g. the system must handle 1,000 transactions per second"
reliability
the availability of components
integrity of information maintained and supplied to the system
e.g. "system must have less than 1hr downtime per three months"
security
E.g. permissible information flows, or who can do what
survivability
E.g. system will need to survive fire, natural catastrophes, etc

Operating requirements
physical constraints (size, weight),
personnel availability & skill level
[Type text]

Page 29

identifying syntax and semantic relation between articles


accessibility for maintenance
environmental conditions

8. Summary
The chapter describes the information Software Requirements Specifications, Operating
Environment-Hardware Requirements & Software Requirements, Functional Requirements, Non
functional requirements, User characteristics, Applications of Project and Advantages of System

Chapter 3
[Type text]

Page 30

identifying syntax and semantic relation between articles

High Level Design


3.1. High Level Design
Design is one of the most important phases of software development. The design is a
creative process in which a system organization is established that will satisfy the functional
and non-functional system requirements. Large Systems are always are decomposed into subsystems that provide some related set of services. The output of the design process is a
description of the Software architecture.

Data Flow Diagram Level 0


The level 0 is the initial level Data flow diagram and its generally called as the
context level diagram. It is common practice for a designer to draw a context-level
DFD first which shows the interaction between the system and outside entities. This
context-level DFD is then exploded to show more detail of the system being
modeled.

Articles

[Type text]

Identify
Syntax and
Semantic
Relation
Fig: DFD Level0

Relation Matrix and


Similarity

Page 31

identifying syntax and semantic relation between articles

Article
Submission

Data
Cleaning

Score and
Similarity
Obtained

Semantic
Relation

Tokenization

Syntactic
Relation

DFD Level1

[Type text]

Page 32

identifying syntax and semantic relation between articles

3.4.19 Data Flow Diagram Level 2

Articles

Read the List


of Articles

Data Cleaning

Score and
Similarity
Obtained

Find all
semantic
Relations

Stopwords

Find all
Syntax
Relations
verb,adverb
and
adjective

Tokenization

Fig. 3.4 Level 2

[Type text]

Page 33

identifying syntax and semantic relation between articles

3.6

Activity diagram

HTML/JSP

Model

Web.xml

-servlet.xml

Controller

S
S

D A TA BASE

Fig Activity Diagram

[Type text]

Page 34

identifying syntax and semantic relation between articles


The above figure gives description about the system architecture which is followed in the
industries in order to a development of any routing software.
The figure shows that the user interface is designed in the HTML/JSP pages and then the
request goes to the web container and web container verifies the request in the web.xml file
by looking first into the url pattern and then it goes to the servlet name and then it searches
for the corresponding servlet name in the servlet tag and looks into the servlet class and
creates an object of Action Servlet and then the action servlet will delegate its job to Request
Processor.
The request processor will look for the action to which must be called in looked up in the
stucts-config.xml and corresponding action form is called and then the action is called. The
action class will then call the delegate , then the delegate calls the service and service calls
the Data Access layer and results goes exactly in the opposite way and the resultant JSP page
is loaded

Model
This is the Plain Old Java Object which will have the getters and setters and setters gets
automatically called and data the user has entered will be available.

Controller
This is the class which is used to fetch the user entered data and then processes it and
calls the delegate layer and obtains the results.
Delegate

[Type text]

Page 35

identifying syntax and semantic relation between articles


Delegate is the layer which contains nothing but call to an appropriate service.

Service
This is the layer which is responsible for entire algorithmic implementation. This is the
layer which contains the heavy weight implementation of entire algorithms. Future the
service would require the help of Data Access Layer for some operations and many other
helper classes.

Data Access Layer


This is the layer which deals with only the CRUD operations namely Create, Retrieve,
Update and Delete. It has no other usage. This layer has been used in order to fetch the data
from the routing tables.

Database
This is the place where all the tables would have been placed have been placed.

3.7. Use case diagrams


The Use Case Diagram is described in the following fig

[Type text]

Page 36

identifying syntax and semantic relation between articles

User

Article
Submissio
n
Data
Cleaning

Tokenizing

Syntactic
Realation

Semantic
Relation

Score
Computa
tion

Stop
Words

Fig: Use Case Diagram

[Type text]

Page 37

identifying syntax and semantic relation between articles

Chapter 4

Detailed design
4.1. Purpose
Detailed Design is a phase where in the internal logic of each of the modules specified in
high-level design is determined. In this phase details and algorithmic design of each of the
modules is specified. Other low-level components and subcomponents are also described. Each
subsection of this section will refer to or contain a detailed description of system software
component. This chapter also discusses about the control flow in the software with much more
details about software modules by explaining the details about each of the functionality.
This chapter presents the following

Life Cycle of Generic Flow

Flowchart for each module

4.1.1 Life Cycle of Generic Flow


This section deals with Life cycle of Security Vulnerability Detection, Analysis and
Remediation in Enterprise Applications and State diagrams and possible transitions between the
states.

View HTML/JSP

Action Form

Action

Fig5.1: Life Cycle of the Process

[Type text]

Page 38

identifying syntax and semantic relation between articles

The following are the stages for any user action

1. View this is the location in which the user will enter the data and performs some action
2. Action Form- this is the POJO which will contain the variables as defined in the view, the
setters of the method and getters of the method. The data will get automatically binded.
3. Action: This is a class which contains the execute method which will be responsible for
handling the logic of the project and is responsible for delegating the result to an
appropriate view. This will also makes use of helper methods to perform business logic.

4.2. User Interface Design


4.2.1. Hardware Interfaces
There are no specific hardware interfaces used in the system

4.2.2. Software Graphical User Interfaces


Abbreviated GUI (pronounced GOO-ee). A program interface that takes advantage of
the computer's graphics capabilities to make the program easier to use. Well-designed graphical
user interfaces can free the user from learning complex command languages. On the other hand,
many users find that they work more effectively with a command-driven interface, especially if
they already know the command language.

Graphical user interfaces, such as Microsoft Windows and the one used by the Apple
Macintosh, feature the following basic components:
[Type text]

Page 39

identifying syntax and semantic relation between articles

pointer : A symbol that appears on the display screen and that you move

to select objects and commands. Usually, the pointer appears as a small angled
arrow. Text -processing applications, however, use an I-beam pointer that is shaped like a
capital I.
pointing device : A device, such as a mouse or trackball, that enables you to select
objects on the display screen.
icons : Small pictures that represent commands, files, or windows. By moving the
pointer to the icon and pressing a mouse button, you can execute a command
or convert the icon into a window. You can also move the icons around the display screen
as if they were real objects on your desk.
desktop : The area on the display screen where icons are grouped is often referred to
as the desktop because the icons are intended to represent real objects on a real desktop.

windows: You can divide the screen into different areas. In each window, you

can run a different program or display a different file. You can move windows around the
display screen, and change their shape and size at will.
menus : Most graphical user interfaces let you execute commands by selecting a
choice from a menu.

The Graphical User interface is developed using the HTML and JSP language
JavaServer Pages Technology
JavaServer Pages (JSP) technology allows you to easily create web content that has both
static and dynamic components. JSP technology makes available all the dynamic capabilities of
Java Servlet technology but provides a more natural approach to creating static content. The
main features of JSP technology are as follows:
1.

A language for developing JSP pages, which are text-based documents that describe how
to process a request and construct a response

2.

An expression language for accessing server-side objects

[Type text]

Page 40

identifying syntax and semantic relation between articles


3.

Mechanisms for defining extensions to the JSP language


A JSP page is a text document that contains two types of text: static data, which
can be expressed in any text-based format (such as HTML, SVG,WML, and XML), and
JSP elements, which construct dynamic content.

CHAPTER4
Detailed Design
Detailed Design is a phase where in the internal logic of each of the modules specified in
high-level design is determined. In this phase details and algorithmic design of each of the
modules is specified. Other low-level components and subcomponents are also described. Each
subsection of this section will refer to or contain a detailed description of system software
component. This chapter also discusses about the control flow in the software with much more
details about software modules by explaining the details about each of the functionality.
This chapter presents the following

Life Cycle of Generic Flow

Flowchart for each module

5.1 Life Cycle of Generic Flow


This section deals with Life cycle of Security Vulnerability Detection, Analysis and
Remediation in Enterprise Applications and State diagrams and possible transitions between the
states.

[Type text]

Page 41

identifying syntax and semantic relation between articles

View HTML/JSP

Model

Controller

Fig5.1: Life Cycle of the Process

The following are the stages for any user action

4. View this is the location in which the user will enter the data and performs some action
5. Model- this is the POJO which will contain the variables as defined in the view, the
setters of the method and getters of the method. The data will get automatically binded.
6. Controller: This is a class which contains the execute method which will be responsible
for handling the logic of the project and is responsible for delegating the result to an
appropriate view. This will also makes use of helper methods to perform business logic.
Detailed design chapter can be described by using the flowcharts

[Type text]

Page 42

identifying syntax and semantic relation between articles

Article Module
The article module is responsible for storage of articles. Article name and article
description acts as an input

Start

Article Name and Article Description

Retrieve the List of Article names in the application

Validation
error

YES

Check
Article in
artclenameli
st
NO
Storage of Article is successful

Fig: Article Module

4.3 Data Cleaning


[Type text]

Page 43

identifying syntax and semantic relation between articles


This is the processing which the data is cleaned from unwanted symbols and set of stop
words. The data undergoes first a delimitation process and then it undergoes a cleaning process.
The set of stop words used in the case of data mining is as given in the below snippet.

List of Stop Words


Stop Word
A
about
above
across
after
afterwards
again
against
all
almost
alone
along
already
also
although

[Type text]

Page 44

identifying syntax and semantic relation between articles


always
am
among
amongst
amoungst
amount
an
and
another
any
anyhow
anyone
anything
anyway
anywhere
are
around
as
at
back

[Type text]

Page 45

identifying syntax and semantic relation between articles


be
became
because
become
becomes
becoming
been
before
beforehand
behind
being
below
beside
besides
between
beyond
bill
both
bottom
but

[Type text]

Page 46

identifying syntax and semantic relation between articles


by
call
can
cannot
cant
co
computer
con
could
couldnt
cry
de
describe
detail
do
done
down
due
during
each

[Type text]

Page 47

identifying syntax and semantic relation between articles


eg
eight
either
eleven
else
elsewhere
empty
enough
etc
even
ever
every
everyone
everything
everywhere
except
few
fifteen
fify
fill

[Type text]

Page 48

identifying syntax and semantic relation between articles


find
fire
first
five
for
former
formerly
forty
found
four
from
front
full
further
get
give
go
had
has
hasnt

[Type text]

Page 49

identifying syntax and semantic relation between articles


have
he
hence
her
here
hereafter
hereby
herein
hereupon
hers
herse
him
himse
his
how
however
hundred
i
ie
if

[Type text]

Page 50

identifying syntax and semantic relation between articles


in
inc
indeed
interest
into
is
it
its
itse
keep
last
latter
latterly
least
less
ltd
made
many
may
me

[Type text]

Page 51

identifying syntax and semantic relation between articles


meanwhile
might
mill
mine
more
moreover
most
mostly
move
much
must
my
myse
name
namely
neither
never
nevertheless
next
nine

[Type text]

Page 52

identifying syntax and semantic relation between articles


no
nobody
none
noone
nor
not
nothing
now
nowhere
of
off
often
on
once
one
only
onto
or
other
others

[Type text]

Page 53

identifying syntax and semantic relation between articles


otherwise
our
ours
ourselves
out
over
own
part
per
perhaps
please
put
rather
re
same
see
seem
seemed
seeming
seems

[Type text]

Page 54

identifying syntax and semantic relation between articles


serious
several
she
should
show
side
since
sincere
six
sixty
so
some
somehow
someone
something
sometime
sometimes
somewhere
still
such

[Type text]

Page 55

identifying syntax and semantic relation between articles


system
take
ten
than
that
the
their
them
themselves
then
thence
there
thereafter
thereby
therefore
therein
thereupon
these
they
thick

[Type text]

Page 56

identifying syntax and semantic relation between articles


thin
third
this
those
though
three
through
throughout
thru
thus
to
together
too
top
toward
towards
twelve
twenty
two
un

[Type text]

Page 57

identifying syntax and semantic relation between articles


under
until
up
upon
us
very
via
was
we
well
were
what
whatever
when
whence
whenever
where
whereafter
whereas
whereby

[Type text]

Page 58

identifying syntax and semantic relation between articles


wherein
whereupon
wherever
whether
which
while
whither
who
whoever
whole
whom
whose
why
will
with
within
without
would
yet
you

[Type text]

Page 59

identifying syntax and semantic relation between articles


your
yours
yourself
yourselves

Module 2- Data Cleaning

Unclean Data

Data Extraction
Using Delimiter and
Data Cleaning using
Stop words

Repository
Repository

Stop Words
Stop Words
Repository
Repository

The Flowchart for the Data Cleaning process is given below

[Type text]

Page 60

identifying syntax and semantic relation between articles

Start

Website Url

Extract the individual words with help of a delimiter like comma or a space

Clean the symbols and if the word belongs to stop word remove it

Clean data is stored in the repository

Stop

Fig: Data Cleaning Process Flowchart

Module 3- Tokenization
After the data is cleaned then the token extraction process begins in which all the words
in the Articles are referred as tokens and are extracted. The Token extraction is done with the
help of again delimiters. The flowchart for the token extraction is as given below

[Type text]

Page 61

identifying syntax and semantic relation between articles

Clean Data

Data Extraction
Using Delimiter

Keywords
Keywords
Repository
Repository

Fig: Keyword Extraction Process

[Type text]

Page 62

identifying syntax and semantic relation between articles


The Flowchart for the Text Extraction process is given below

Start

Clean data

Extract the individual words with help of a delimiter like comma or a space

i<= no of
tokens

Store token in repository

I=i+1

Stop

Fig: Token Extraction Process Flowchart

Fig shows the Token extraction process where the clean data is scanned to obtain
keywords know as tokens and then these tokens are stored in the repository.

[Type text]

Page 63

Вам также может понравиться