Академический Документы
Профессиональный Документы
Культура Документы
Chapter 1
INTRODUCTION
The Web has undergone exponential growth since its birth, and this expansion has
generated a number of problems; in this paper we address two of these: 1. The proliferation of
documents that are identical or almost identical. 2. The instability of URLs. The basis of our
approach is a mechanism for discovering when two documents are "roughly the same"; that is,
for discovering when they have the same content except for modifications such as formatting,
minor corrections, webmaster signature, or logo. Similarly, we can discover when a document is
"roughly contained" in another. Applying this mechanism to the entire collection of documents
found by the AltaVista spider yields a grouping of the documents into clusters of closely related
items. As explained below, this clustering can help solve the problems of document duplication
and URL instability. The duplication problem arises in two ways: First, there are documents that
are found in multiple places in identical form. Some examples are FAQ (Frequently Asked
Questions) or RFC (Request For Comments) documents. The online documentation for popular
programs. Documents stored in several mirror sites. Legal documents. Second, there are
documents that are found in almost identical incarnations because they are:
1) Different versions of the same document.
2) The same document with different formatting.
3) The same document with site specific links, customizations or contact information.
4) Combined with other source material to form a larger document.
The instability problem arises when a particular URL becomes undesirable because: The
associated document is temporarily unavailable or has moved. The URL refers to an old version
and the user wants the current version. The URL is slow to access and the the user wants an
identical or similar document that will be faster to retrieve. In all these cases, the ability to find
documents that are syntactically similar to a given document allows the user to find other,
acceptable versions of the desired item. URNs (Uniform Resource Names) have often been
[Type text]
Page 1
Identical documents do not need to be handled specially in our algorithm, but they add to
the computational workload and can be eliminated quite easily. Identical documents obviously
share the same set of shingles and so, for the clustering algorithm, we only need to keep one
representative from each group of identical documents. Therefore, for each document we
generate a fingerprint that covers its entire contents. When we find documents with identical
fingerprints, we eliminate all but one from the clustering algorithm. After the clustering has been
completed, the other identical documents are added into the cluster containing the one kept
version. We can expand the collection of identical documents with the "lexically-equivalent"
documents and the "shingle-equivalent" documents. The lexically-equivalent documents are
identical after they have been converted to canonical form. The shingle-equivalent documents are
documents that have identical shingle values after the set of shingles has been selected.
Obviously, all identical documents are lexically-equivalent, and all lexically equivalent
documents are shingle equivalent. We can find each set of documents with a single fingerprint.
Identical documents are found with the fingerprint of the entire original contents. Lexicallyequivalent documents are found with the fingerprint of the entire canonicalized contents. Shingle
equivalent documents are found with the fingerprint of the set of selected shingles.
Page 2
Methodology
[Type text]
Page 3
Article
Submission
Data
Cleaning
similarity
measure
Token
Determination
Syntantic
Relation
Score
Computatio
n
Semantic
Relation
Data Cleaning- This module is used in order to remove stop words from the Article.
3. Tokenizations- This process in used to obtain all the keywords of the Article and assign
them a unique ID as well as the web site id.
4. Syntatic Relation This Module is responsible for finding the syntactic relation of
articles i.e verb,adverb and adjectives of articles
5. Semantic Relation This module is used to find out the various semantic relations i.e
hypernism
6. Score Computation This is used to measure the score with respect to syntax and
semantic relation
[Type text]
Page 4
Problem Definition
Semantic and syntactic relations play an important role of applications in recent years,
especially on Semantic Web, Information Retrieval, Information Extraction, and Question
Answering. Semantic and syntactic relations content main ideas in the sentences or paragraphs.
This project presents our proposed algorithms for identifying semantic and syntactic relations
between objects and their properties in order to enrich a domain specific ontology, namely
Computing Domain Ontology, which is used in Information extraction system
Previous Approach
Proposed Approach
The proposed approach is to automatically identify the syntactic and semantic relations
that might be found in text documents of articles of specific domain. Afterward, we extract these
relations in order to enrich domain specific ontology. This ontology can be used in many
applications, such as Information Retrieval, Information Extraction, and Question answering
focusing on computing domain. For this purpose, we propose a methodology, which combine
Natural Language Processing (NLP) and Matching Learning.
Methodology
[Type text]
Page 5
Article Submission
Identify Syntax Relations
View Articles
Tokenization
Data Cleaning
Stop word
Analysis
Synonyms, hyponyms,
hypernyms of instance data
Article Submission
The Article Submission is used for submitting the article with article name and article description
View Articles
This module is responsible for viewing the articles
Data Cleaning
This module is responsible for preprocessing and cleaning of the text data. The module
makes use of Stop words in order to perform the analysis and do the cleaning .Data Cleaning is
used for removing the stop words from each of the tweets and clean them. After the data cleaning
process is completed the clean data can be represented as a set
Stopwords
[Type text]
Page 6
Syntax Analyzer
Syntactic relations are the relations between concepts or words in the sentence with
respect to verb, adverb
[Type text]
Page 7
Chapter2
LITERATURE SURVEY
In the paper [1] titled Syntactic clustering of the web the authors have developed an
efficient way to determine the syntactic similarity of files and have applied it to every document
on the World Wide Web. Using this mechanism, we built a clustering of all the documents that
are syntactically similar. Possible applications include a "Lost and Found" service, filtering the
results of Web searches, updating widely distributed web-pages, and identifying violations of
intellectual property rights.
In the paper [2] titled Efficient near-duplicate detection for Q&A forum the authors
propose that r addresses the issue of redundant data in large-scale collections of Q&A forums.
The authors propose and evaluate a novel algorithm for automatically detecting the nearduplicate Q&A threads. The main idea is to use the distributed index and Map Reduce
framework to calculate pair wise similarity and identify redundant data fast and scalable. The
proposed method was evaluated on a real-world data collection crawled from a popular Q&A
forum. Experimental results show that our proposed method can effectively and efficiently detect
near duplicate content in large web collections. Two distributed inverted index methods to
calculate similarities in parallel using Map Reduce framework. We defined the near duplicate
Q&A thread and used the evaluated signatures, parallel similarity calculating and a liner
combination method to extract near-duplications. Experimental results in the real-world
collection show that the proposed method can be effectively and efficiently used to detect nearduplicates. About 15.78% of Q&A threads contain more than one near duplicates in the
collection
In the paper[3] titled A Practical Approach for Relevance Measure of Inter-Sentenc the
authors propose that Many natural language processing tasks, such as text classification, text
clustering, text summarization, and information retrieval etc., cannot miss the step-relevance
measure of inter-sentence. However, many of the current NLP system always calculate not the
inter-sentence relevance but their similarity. In fact, similarity means differently from relevance.
[Type text]
Page 8
[Type text]
Page 9
In the paper [7] titled Sematic Representation and Search Techniques for Document
Retrieval Systems the authors describe that Nowadays, organizing a repository of documents
and resources for learning on a special field as Information Technology, together with search
techniques based on domain knowledge or document content is an urgent need in practice of
teaching, learning and researching. There have been several works related to methods of
organization and search by content. However, the results are still limited and insufficient to meet
users demand for semantic document retrieval. This paper presents a solution for the
organization of a repository that supports semantic representation and processing in search. The
proposed solution is a model that shows the integration of components such as an ontology
describing domain knowledge, a database of document repository, semantic representation for
documents and a file system; with problems, semantic processing techniques and advanced
search techniques based on measuring semantic similarity. The solution is applied to build a
document retrieval system in the field of Information Technology, with semantic search function
serving students, teachers, and manager as well
In the paper[8] titled SpotSigs: Robust and Efficient Near Duplicate Detection in Large
Web Collections the authors describe that political scientists who need to manually analyze
large Web archives of news sites, we present SpotSigs, a new algorithm for extracting and
matching signatures for near duplicate detection in large Web crawls. Our spot signatures are
designed to favor natural language portions of Web pages over advertisements and navigational
bars. The contributions of SpotSigs are twofold: 1) by combining stopword antecedents with
short chains of adjacent content terms, we create robust document signatures with a natural
ability to filter out noisy components of Web pages that would otherwise distract pure n-grambased approaches such as Shingling; 2) we provide an exact and efficient, selftuning matching
algorithm that exploits a novel combination of collection partitioning and inverted index pruning
for high-dimensional similarity search
[Type text]
Page 10
In the paper [9] titled Adaptive near-duplicate detection via similarity learning the
authors describe that present a novel near-duplicate document detection method that can easily
be tuned for a particular domain. Our method represents each document as a real-valued
sparse k-gram vector, where the weights are learned to optimize for a specified similarity
function, such as the cosine similarity or the Jaccard coefficient. Near-duplicate documents can
be reliably detected through this improved similarity measure. In addition, these vectors can be
mapped to a small number of hash-values as document signatures through the locality sensitive
hashing scheme for efficient similarity computation
In the paper [10] titled Learning to extract keyphrases from text the authors describe
that Recent commercial software, such as Microsoft?s Word 97 and Verity?s Search 97, includes
algorithms that automatically extract keyphrases from documents. In this paper, we approach the
problem of automatically extracting keyphrases from text as a supervised learning task. We treat
a document as a set of phrases, which the learning algorithm must learn to classify as positive or
negative examples of keyphrases
In the paper [11] titled Keyword Extraction from Documents Using a Neural Network Model
the authors describe that A document surrogate is usually represented in a list of words. Because
not all words in a document reflect its content, it is necessary to select important words from the
document that relate to its content. Such important words are called keywords and are selected
with a particular equation based on Term Frequency (TF) and Inverted Document Frequency
(IDF). Additionally, the position of each word in the document and the inclusion of the word in
the title should be considered to select keywords among words contained in the text. The
equation based on these factors gets too complicated to be applied to the selection of keywords
[Type text]
Page 11
Chapter 3
Resource Requirement
Netbean IDE 6.9.1 : Netbean is a multi-language software development environment comprising
an integrated development environment (IDE) and an extensible plug-in system. It is written
primarily in Java and can be used to develop applications in Java and, by means of the various
plug-ins, in other languages as well, including C, C++, COBOL, Python, Perl, PHP, and others.
[Type text]
Page 12
java the loader for Java applications. This tool is an interpreter and can interpret the
class files generated by the javac compiler. Now a single launcher is used for both
development and deployment. The old deployment launcher, jre, no longer comes with
Sun JDK, and instead it has been replaced by this new java loader.
javac the compiler, which converts source code into java Bytecode
appletviewer this tool can be used to run and debug Java applets without a web browser
idlj the IDL-to-Java compiler. This utility generates Java bindings from a given java
IDL file.
[Type text]
Page 13
jar the archiver, which packages related class libraries into a single jar file. This tool
also helps manage JAR files.
javah the C header and stub generator, used to write native methods
jinfo This utility gets configuration information from a running Java process or crash
dump. (experimental)
jmap This utility outputs the memory map for Java and can print shared object memory
maps or heap memory details of a given process or core dump. (experimental)
jps Java Virtual Machine Process Status Tool lists the instrumented HotSpot Java
Virtual Machines (JVMs) on the target system. (experimental)
jstack utility which prints Java stack traces of Java threads (experimental)
policytool the policy creation and management tool, which can determine policy for a
Java runtime, specifying which permissions are available for code from various sources
VisualVM visual tool integrating several command line JDK tools and lightweight
performance and memory profiling capabilities.
[Type text]
Page 14
xjc Part of the Java API for XML Binding (JAXB) API. It accepts an XML schema and
generates Java classes.
components,
compliant with the Java Beans Component Architecture specifications. Through customizable
[Type text]
Page 15
Page 16
A servlet is the most basic J2EE web component. It is managed by the servlet container. All
servlets implement the Servlet interface directly or indirectly. In general terms, a servlet is the
endpoint for requests adhering to a protocol. However, the Servlet specification mandates
implementation for servlets that handle HTTP requests only. But you should know that it is
possible to implement the servlet and the container to handle other protocols such as FTP too.
When writing Servlets for handling HTTP requests, you generally subclass HttpServlet class.
HTTP has six methods of request submission GET, POST, PUT, HEAD
and DELETE. Of these, GET and POST are the only forms of request submission relevant to
application developers. Hence your subclass of HttpServlet should implement two methods
doGet() and doPost() to handle GET and POST respectively
[Type text]
Page 17
In the last section, you saw how Servlets produced output HTML in addition to executing
business logic. So why arent Servlets used for presentation tier? The answer lies in the
separation of concerns essential in real world J2EE projects. Back in the days when JSPs didnt
exist, servlets were all that you had to build J2EE web applications. They handled requests from
the browser,invoked middle tier business logic and rendered responses in HTML to the browser.
Now thats a problem. A Servlet is a Java class coded by Java programmers. It is okay to handle
browser requests and have business and presentation logic in the servlets since that is where they
belong. HTML formatting and rendering is the concern of page author who most likely does not
know Java. So, the question arises, how to separate these two concerns intermingled in Servlets?
JSPs are the answer to this dilemma. JSPs are servlets in disguise!
[Type text]
Page 18
Page 19
The servlet container parses the JSP and executes the resulting Java servlet. The JSP
contains embedded code and tags to access the Model JavaBeans. The Model JavaBeans contains
attributes for holding the HTTP request parameters from the query string. In addition it contains
logic to connect to the middle tier or directly to the database using JDBC to get the additional
data needed to display the page. The JSP is then rendered as HTML using the data in the Model
JavaBeans and other Helper classes and tags.
Problems with Model 1 Architecture
Model 1 architecture is easy. There is some separation between content (Model JavaBeans) and
presentation (JSP). This separation is good enough for smaller applications. Larger applications
have a lot of presentation logic. In Model 1 architecture, the presentation logic usually leads to a
significant amount of Java code embedded in the JSP in the form of scriptlets. This is ugly and
maintenance nightmare even for experienced Java developers. In large applications, JSPs are
developed and maintained by page authors. The intermingled scriptlets and markup results in
unclear definition of roles and isvery problematic.
[Type text]
Page 20
The main difference between Model 1 and Model 2 is that in Model 2, a controller handles the
user request instead of another JSP. The controller is implemented as a Servlet. The following
steps are executed when the user submits the request.
1. The Controller Servlet handles the users request. (This means the hyperlink
in the JSP should point to the controller servlet).
2. The Controller Servlet then instantiates appropriate JavaBeans based on the
request parameters (and optionally also based on session attributes).
3. The Controller Servlet then by itself or through a controller helpercommunicates with the
middle tier or directly to the database to fetch the required data.
4. The Controller sets the resultant JavaBeans (either same or a new one) in one
of the following contexts request, session or application.
[Type text]
Page 21
[Type text]
Page 22
[Type text]
Page 23
servlet
is
shown
in
Figure
When the HTTP request arrives from the client, the Controller Servlet looks up in a
properties file to decide on the right Handler class for the HTTP request. This Handler class is
referred to as the Request Handler. The Request Handler contains the presentation logic for that
HTTP request including business logic invocation. In other words, the Request Handler does
everything that is needed to handle the HTTP request. The only difference so far from the bare
bone MVC is that the controller servlet looks up in a properties file to instantiate the Handler
instead of calling it directly
[Type text]
Page 24
[Type text]
Page 25
In Struts, there is only one controller servlet for the entire web application. This
controller servlet is called ActionServlet and resides in the package org.apache.struts.action.
It intercepts every client request and populates an ActionForm from the HTTP request
parameters. ActionForm is a normal JavaBeans class. It has several attributes corresponding to
the HTTP request parameters and getter, setter methods for those attributes. You have to create
your own ActionForm for every HTTP request handled through the Struts framework by
extending the org.apache.struts.action.ActionForm class.
For the lack of better terminology, let us coin a termto describe the classes such as ActionForm
View Data Transfer Object. View Data Transfer Object is an object that holds the data from html
page and transfers it around in the web tier framework and application classes.
The ActionServlet then instantiates a Handler. The Handler class name is obtained from
an XML file based on the URL path information. This XML file is referred to as Struts
configuration file and by default named as struts-config.xml.
[Type text]
Page 26
Now, that was Struts in a nutshell. Struts is of-course more than just this. It is a full-fledged
presentation framework. Throughout the development of the application, both the page author
and the developer need to coordinate and ensure that any changes to one area are appropriately
handled in the other. It aids in rapid development of web applications by separating the concerns
in projects.For instance, it has custom tags for JSPs. The page author can concentrate on
developing the JSPs using custom tags that are specified by the framework. The application
developer works on creating the server side representation of the data and its interaction with a
back end data repository. Further it offers a consistent way of handling user input and processing
it.
[Type text]
Parameter
RAM
Hard Disk
Java Development Kit-Version
Database
Database Front End
Tool For Java Development
Front End Technology
Framework
Sever
Description
500MB-1GB
120GB-160GB
JDK 1.5
MySQL
Heildi SQL/Toad For MySQL
Ecclipse
JSP
Spring-Framework
Tomcat8.0
Page 27
Parameter Name
Development Language
Java Development Kit Version
Java Run Time Environment
Database for Routing Tables Backend
Database Front End for Routing Tables
Database Front End for exporting Excel
Parameter Value
JAVA
Jdk 1.6
Jre 6
MySQL
Heildi SQL
Toad For MySQL
7
8
9
11
12
13
Sheets
Development Tool
Sever Type
Web Server
Framework Used
View Technology Used
Designing
Eccilpse
Web Server
Tomcat 6.0
Structs Framework
Java Server Pages
Cascading Style Sheets
Data Cleaning- This module is used in order to remove stop words from the Article.
3. Tokenizations- This process in used to obtain all the keywords of the Article and assign
them a unique ID as well as the web site id.
[Type text]
Page 28
Performance requirements
time/space bounds
workloads, response time, throughput and available storage space
e.g. the system must handle 1,000 transactions per second"
reliability
the availability of components
integrity of information maintained and supplied to the system
e.g. "system must have less than 1hr downtime per three months"
security
E.g. permissible information flows, or who can do what
survivability
E.g. system will need to survive fire, natural catastrophes, etc
Operating requirements
physical constraints (size, weight),
personnel availability & skill level
[Type text]
Page 29
8. Summary
The chapter describes the information Software Requirements Specifications, Operating
Environment-Hardware Requirements & Software Requirements, Functional Requirements, Non
functional requirements, User characteristics, Applications of Project and Advantages of System
Chapter 3
[Type text]
Page 30
Articles
[Type text]
Identify
Syntax and
Semantic
Relation
Fig: DFD Level0
Page 31
Article
Submission
Data
Cleaning
Score and
Similarity
Obtained
Semantic
Relation
Tokenization
Syntactic
Relation
DFD Level1
[Type text]
Page 32
Articles
Data Cleaning
Score and
Similarity
Obtained
Find all
semantic
Relations
Stopwords
Find all
Syntax
Relations
verb,adverb
and
adjective
Tokenization
[Type text]
Page 33
3.6
Activity diagram
HTML/JSP
Model
Web.xml
-servlet.xml
Controller
S
S
D A TA BASE
[Type text]
Page 34
Model
This is the Plain Old Java Object which will have the getters and setters and setters gets
automatically called and data the user has entered will be available.
Controller
This is the class which is used to fetch the user entered data and then processes it and
calls the delegate layer and obtains the results.
Delegate
[Type text]
Page 35
Service
This is the layer which is responsible for entire algorithmic implementation. This is the
layer which contains the heavy weight implementation of entire algorithms. Future the
service would require the help of Data Access Layer for some operations and many other
helper classes.
Database
This is the place where all the tables would have been placed have been placed.
[Type text]
Page 36
User
Article
Submissio
n
Data
Cleaning
Tokenizing
Syntactic
Realation
Semantic
Relation
Score
Computa
tion
Stop
Words
[Type text]
Page 37
Chapter 4
Detailed design
4.1. Purpose
Detailed Design is a phase where in the internal logic of each of the modules specified in
high-level design is determined. In this phase details and algorithmic design of each of the
modules is specified. Other low-level components and subcomponents are also described. Each
subsection of this section will refer to or contain a detailed description of system software
component. This chapter also discusses about the control flow in the software with much more
details about software modules by explaining the details about each of the functionality.
This chapter presents the following
View HTML/JSP
Action Form
Action
[Type text]
Page 38
1. View this is the location in which the user will enter the data and performs some action
2. Action Form- this is the POJO which will contain the variables as defined in the view, the
setters of the method and getters of the method. The data will get automatically binded.
3. Action: This is a class which contains the execute method which will be responsible for
handling the logic of the project and is responsible for delegating the result to an
appropriate view. This will also makes use of helper methods to perform business logic.
Graphical user interfaces, such as Microsoft Windows and the one used by the Apple
Macintosh, feature the following basic components:
[Type text]
Page 39
pointer : A symbol that appears on the display screen and that you move
to select objects and commands. Usually, the pointer appears as a small angled
arrow. Text -processing applications, however, use an I-beam pointer that is shaped like a
capital I.
pointing device : A device, such as a mouse or trackball, that enables you to select
objects on the display screen.
icons : Small pictures that represent commands, files, or windows. By moving the
pointer to the icon and pressing a mouse button, you can execute a command
or convert the icon into a window. You can also move the icons around the display screen
as if they were real objects on your desk.
desktop : The area on the display screen where icons are grouped is often referred to
as the desktop because the icons are intended to represent real objects on a real desktop.
windows: You can divide the screen into different areas. In each window, you
can run a different program or display a different file. You can move windows around the
display screen, and change their shape and size at will.
menus : Most graphical user interfaces let you execute commands by selecting a
choice from a menu.
The Graphical User interface is developed using the HTML and JSP language
JavaServer Pages Technology
JavaServer Pages (JSP) technology allows you to easily create web content that has both
static and dynamic components. JSP technology makes available all the dynamic capabilities of
Java Servlet technology but provides a more natural approach to creating static content. The
main features of JSP technology are as follows:
1.
A language for developing JSP pages, which are text-based documents that describe how
to process a request and construct a response
2.
[Type text]
Page 40
CHAPTER4
Detailed Design
Detailed Design is a phase where in the internal logic of each of the modules specified in
high-level design is determined. In this phase details and algorithmic design of each of the
modules is specified. Other low-level components and subcomponents are also described. Each
subsection of this section will refer to or contain a detailed description of system software
component. This chapter also discusses about the control flow in the software with much more
details about software modules by explaining the details about each of the functionality.
This chapter presents the following
[Type text]
Page 41
View HTML/JSP
Model
Controller
4. View this is the location in which the user will enter the data and performs some action
5. Model- this is the POJO which will contain the variables as defined in the view, the
setters of the method and getters of the method. The data will get automatically binded.
6. Controller: This is a class which contains the execute method which will be responsible
for handling the logic of the project and is responsible for delegating the result to an
appropriate view. This will also makes use of helper methods to perform business logic.
Detailed design chapter can be described by using the flowcharts
[Type text]
Page 42
Article Module
The article module is responsible for storage of articles. Article name and article
description acts as an input
Start
Validation
error
YES
Check
Article in
artclenameli
st
NO
Storage of Article is successful
Page 43
[Type text]
Page 44
[Type text]
Page 45
[Type text]
Page 46
[Type text]
Page 47
[Type text]
Page 48
[Type text]
Page 49
[Type text]
Page 50
[Type text]
Page 51
[Type text]
Page 52
[Type text]
Page 53
[Type text]
Page 54
[Type text]
Page 55
[Type text]
Page 56
[Type text]
Page 57
[Type text]
Page 58
[Type text]
Page 59
Unclean Data
Data Extraction
Using Delimiter and
Data Cleaning using
Stop words
Repository
Repository
Stop Words
Stop Words
Repository
Repository
[Type text]
Page 60
Start
Website Url
Extract the individual words with help of a delimiter like comma or a space
Clean the symbols and if the word belongs to stop word remove it
Stop
Module 3- Tokenization
After the data is cleaned then the token extraction process begins in which all the words
in the Articles are referred as tokens and are extracted. The Token extraction is done with the
help of again delimiters. The flowchart for the token extraction is as given below
[Type text]
Page 61
Clean Data
Data Extraction
Using Delimiter
Keywords
Keywords
Repository
Repository
[Type text]
Page 62
Start
Clean data
Extract the individual words with help of a delimiter like comma or a space
i<= no of
tokens
I=i+1
Stop
Fig shows the Token extraction process where the clean data is scanned to obtain
keywords know as tokens and then these tokens are stored in the repository.
[Type text]
Page 63