Efficient Top-K Retrieval On Massive Data

Efficient Top K on Massive Data
Abstract:
Top-k query is an important operation to return a set of interesting points in
a potentially huge data space. It is analyzed in this paper that the existing
algorithms cannot process top-k query on massive data efficiently. This paper
proposes a novel table-scan-based T2S algorithm to efficiently compute top-k
results on massive data. T2S first constructs the presorted table, whose tuples are
arranged in the order of the round-robin retrieval on the sorted lists. T2S maintains
only fixed number of tuples to compute results. The early termination checking for
T2S is presented in this paper, along with the analysis of scan depth. The selective
retrieval is devised to skip the tuples in the presorted table which are not top-k
results. The theoretical analysis proves that selective retrieval can reduce the
number of the retrieved tuples significantly. The construction and incrementalupdate/batch-processing methods for the used structures are proposed.
Introduction:
Top-k query is an important operation to return a set of interesting points
from a potentially huge data space. In top-k query, a ranking function F is provided
to determine the score of each tuple and k tuples with the largest scores are
returned. Due to its practical importance, top-k query has attracted extensive
attention proposes a novel table-scan-based T2S algorithm (Top-k by Table Scan)
to compute top-k results on massive data efficiently.
The analysis of scan depth in T2S is developed also. The result size k is
usually small and the vast majority of the tuples retrieved in PT are not top-k
results, this paper devises selective retrieval to skip the tuples in PT which are not
query results. The theoretical analysis proves that selective retrieval can reduce the
number of the retrieved tuples significantly.
The construction and incremental-update/batch-processing methods for the

data structures are proposed in this paper. The extensive experiments are conducted
on synthetic and real life data sets.
Existing System:
To its practical importance, top-k query has attracted extensive attention.
The existing top-k algorithms can be classified into three types: indexbased
methods view-based methods and sorted-list-based methods . Index-based methods
(or view-based methods) make use of the pre-constructed indexes or views to
process top-k query.
A concrete index or view is constructed on a specific subset of attributes, the

indexes or views of exponential order with respect to attribute number have to be
built to cover the actual queries, which is prohibitively expensive. The used
indexes or views can only be built on a small and selective set of attribute
combinations.
Sorted-list-based methods retrieve the sorted lists in a round-robin fashion,

maintain the retrieved tuples, update their lower-bound and upper-bound scores.
When the kth largest lower-bound score is not less than the upper-bound scores of
other candidates, the k candidates with the largest lower-bound scores are top-k
results.
Sorted-list-based methods compute topk results by retrieving the involved

sorted lists and naturally can support the actual queries. However, it is analyzed in
this paper that the numbers of tuples retrieved and maintained in these methods
increase exponentially with attribute number, increase polynomially with tuple
number and result size.
Disadvantages:
Computational Overhead.
Data redundancy is more.
Time consuming process.
Problem Definition:
Ranking is a central part of many information retrieval problems, such
as document retrieval, collaborative filtering, sentiment analysis, computational
advertising (online ad placement).
Training data consists of queries and documents matching them together with
relevance
degree
of
each
match.
It
may
be
prepared
manually
by
human assessors (or raters, as Google calls them), who check results for some
queries and determine relevance of each result. It is not feasible to check relevance
of all documents, and so typically a technique called pooling is used only the top
few documents, retrieved by some existing ranking models are checked.
Typically, users expect a search query to complete in a short time (such as a few
hundred milliseconds for web search), which makes it impossible to evaluate a
complex ranking model on each document in the corpus, and so a two-phase
scheme is used.
Literature Survey
1) Best Position Algorithms for Top-k Query Processing in Highly

Distributed Environments
Efficient top-k query processing in highly distributed environments is useful but
challenging. This paper focuses on the problem over vertically partitioned data and
aims to propose efficient algorithms with lower communication cost. Two new
algorithms, DBPA and BulkDBPA, are proposed in this paper. DBPA is a direct
extension of the centralized algorithm BPA2 into distributed environments.
Absorbing the advantage of low data access of BPA2, DBPA has the advantage of
low data transfer, though it requires a lot of communication round trips which
greatly affect the response time of the algorithm. BulkDBPA improves DBPA by
utilizing bulk read and bulk transfer mechanism which can significantly reduce its
round trips. Experimental results show that DBPA and BulkDBPA require much
less data transfer than SA and TPUT, and BulkDBPA outperforms the other
algorithms on overall performance. We also analyze the effect of different
parameters on query performance of BulkDBPA and especially investigate the
setting strategies of the bulk size.
DISADVANTAGES:
1. The Computation Overhead is greatly affected by the size of dictionary and
the number of documents, and almost has no relation to the number of query
keywords
2) Supporting early pruning in top-k query processing on massive data
This paper analyzes the execution behavior of No Random Accesses (NRA) and
determines the depths to which each sorted file is scanned in growing phase and
shrinking phase of NRA respectively. The analysis shows that NRA needs to
maintain a large quantity of candidate tuples in growing phase on massive data.
Based on the analysis, this paper proposes a novel top-k algorithm TopK with Early Pruning (TKEP) which performs early pruning in growing phase.
General rule and mathematical analysis for early pruning are presented in this
paper. The theoretical analysis shows that early pruning can prune most of the
candidate tuples. Although TKEP is an approximate method to obtain the topk result, the probability for correctness is extremely high. Extensive experiments
show that TKEP has a significant advantage over NRA.
DISADVANTAGES:
1. It significantly limits the usability of outsourced data due to the difficulty of
searching over the encrypted data.
3) Efficient skyline computation on big data
Skyline is an important operation in many applications to return a set of interesting

points from a potentially huge data space. Given a table, the operation finds all
tuples that are not dominated by any other tuples. It is found that the existing
algorithms cannot process skyline on big data efficiently. This paper presents a
novel skyline algorithm SSPL on big data. SSPL utilizes sorted positional index
lists which require low space overhead to reduce I/O cost significantly. The sorted
positional index list Lj is constructed for each attribute Aj and is arranged in
ascending order of Aj. SSPL consists of two phases. In phase 1, SSPL computes
scan depth of the involved sorted positional index lists. During retrieving the lists
in a round-robin fashion, SSPL performs pruning on any candidate positional index
to discard the candidate whose corresponding tuple is not skyline result. Phase 1
ends when there is a candidate positional index seen in all of the involved lists. In
phase 2, SSPL exploits the obtained candidate positional indexes to
get skyline results by a selective and sequential scan on the table.
DISADVANTAGES:
1) It cannot achieve better efficiency
4) Efficient processing of exact top-k queries over disk-resident sorted lists

The top-k query is employed in a wide range of applications to generate a ranked
list of data that have the highest aggregate scores over certain attributes. As the
pool of attributes for selection by individual queries may be large, the data are
indexed with per-attribute sorted lists, and a threshold algorithm (TA) is applied on
the lists involved in each query. The TA executes in two phases--find a cut-off
threshold for the top-k result scores, then evaluate all the records that could score
above the threshold. In this paper, we focus on exact top-k queries that involve
monotonic linear scoring functions over disk-resident sorted lists. We introduce a

model for estimating the depths to which each sorted list needs to be processed in
the two phases, so that (most of) the required records can be fetched efficiently
through sequential or batched I/Os. We also devise a mechanism to quickly rank
the data that qualify for the query answer and to eliminate those that do not, in
order to reduce the computation demand of the query processor.
DISADVANTAGES:
1. The Computation Overhead is greatly affected by the size of dictionary and
the number of documents, and almost has no relation to the number of query
keywords
Proposed System:
Our proposed system describe with layered indexing to organize the tuples
into multiple consecutive layers. The top-k results can be computed by at most k
layers of tuples. Also our propose layer-based Pareto-Based Dominant Graph to
express the dominant relationship between records and top-k query is implemented
as a graph traversal problem.
Then propose a dual-resolution layer structure. Top k query can be processed

efficiently by traversing the dual-resolution layer through the relationships between
tuples. propose the Hybrid- Layer Index, which integrates layer level filtering and
list-level filtering to significantly reduce the number of tuples retrieved in query
processing propose view-based algorithms to pre-construct the specified
materialized views according to some ranking functions.
Given a top-k query, one or more optimal materialized views are selected to
return the top-k results efficiently. Propose LPTA+ to significantly improve
efficiency of the state-of-the-art LPTA algorithm. The materialized views are
cached in memory; LPTA+ can reduce the iterative calling of the linear
programming sub-procedure, thus greatly improving the efficiency over the LPTA
algorithm. In practical applications, a concrete index (or view) is built on a specific
subset of attributes. Due to prohibitively expensive overhead to cover all attribute
combinations, the indexes (or views) can only be built on a small and selective set
of attribute combinations.
If the attribute combinations of top-k query are fixed, index-based or viewbased

methods can provide a superior performance. However, on massive data, users
often issue ad-hoc queries, it is very likely that the indexes (or views) involved in
the ad-hoc queries are not built and the practicability of these methods is limited
greatly.
Correspondingly, T2S only builds presorted table, on which top-k query on any
attribute combination can be dealt with. This reduces the space overhead
significantly compared with index-based (or view-based) methods, and enables

actual practicability for T2S.
Advantages:
The evaluation of an information retrieval system is the process of assessing

how well a system meets the information needs of its users.
Traditional evaluation metrics, designed for Boolean retrieval or top-k
retrieval, include precision and recall.
All common measures described here assume a ground truth notion of
relevancy: every document is known to be either relevant or non-relevant to
a particular query.
OVERVIEW OF MICROSOFT.NET
.NET represents Microsoft's vision of the future of applications in the Internet
age. .NET provides enhanced interoperability features based upon open Internet
standards. Microsoft .NET represents a great improvement.
Microsoft .NET provides the following:

A robust runtime platform, the CLR
Multiple language development

An extensible programming model, the .NET Framework, which provides a
large class library of reusable code available from multiple languages
A networking infrastructure built on top of Internet standards that supports a
high level of communication among applications
A new mechanism of application delivery, the Web service, that supports the
concept of an application as a service
Powerful development tools
.NET Framework Overview
The .NET Framework consists of the CLR, the .NET Framework Class Library, the
Common Language Specification (CLS), a number of .NET languages, and Visual
Studio .NET.
Common Language Runtime

The runtime environment provided by .NET, the CLR, manages the execution of
code and provides useful services. The services of the CLR are exposed through
programming languages. The syntax for these services varies from language to
language, but the underlying execution engine providing the services is the same.
Not all languages expose all the features of the CLR. The language with the best
mapping 45 to the CLR is the new language C#. VB.NET, however, does an
admirable job of exposing the functionality.
.NET Framework Class Library
The .NET Framework class library is huge, comprising more than 2,500 classes.
All this functionality is available to all the .NET languages. The library consists of
four main parts:
1.
Base class library (which includes networking, security, diagnostics, I/O, and
other
Types of operating system services)
2. Data and XML classes
3. Windows UI
4. Web services and Web UI
Common Language Specification
The CLS is an agreement among language designers and class library designers
about those features and usage conventions that can be relied upon. CLS rules
apply to public features that are visible outside the assembly where they are
defined.
Languages in .NET
Microsoft itself is providing four CLS-compliant languages. VB.NET, C#, and C+

+ with managed extensions are extenders. Jscript .NET is a consumer.
Visual Studio .NET 2008

Visual Studio .NET 2008 includes a range of new features and enhancements for
every type of developer, and offers key improvements directed at mobile device
developers and enterprise developers.
Base classes provide standard functionality such as input/output, string
manipulation,
security
management,
network
communications;
thread
management, text management, and user interface design features.

The ADO.NET classes enable developers to interact with data accessed in the form
of XML through the OLE DB, ODBC, Oracle, and SQL Server interfaces. The
ASP.NET classes
Support the development of Web-based applications and Web services. The

Windows Forms classes support the development of desktop-based smart client
applications.
ASP.NET
ASP.NET is a programming framework built on the common language runtime that

can be used on a server to build powerful Web applications. ASP.NET offers
several important advantages over previous Web development models:
Enhanced Performance
ASP.NET is compiled common language runtime code running on the server.

Unlike its interpreted predecessors, ASP.NET can take advantage of early binding,
just-in-time compilation, native optimization, and caching services right out of the
box. This amounts to dramatically better performance before you ever write a line
of code
World-Class Tool Support
The ASP.NET framework is complemented by a rich toolbox and designer in the

Visual Studio integrated development environment. WYSIWYG editing, drag-anddrop server controls, and automatic deployment are just a few of the features this
powerful tool provides.
Power and Flexibility
Because ASP.NET is based on the common language runtime, the power and
flexibility of that entire platform is available to Web application developers. The
.NET Framework class library, Messaging, and Data Access solutions are all
seamlessly accessible from the Web. ASP.NET is also language-independent, so
you can choose the language that best applies to your application or partition your
application across many languages.
Simplicity
ASP.NET makes it easy to perform common tasks, from simple form submission
and client authentication to deployment and site configuration. For example, the
ASP.NET page framework allows you to build user interfaces that cleanly separate
application logic from presentation code and to handle events in a simple, Visual
Basic - like forms processing model. Additionally, the common language runtime
simplifies development, with managed code services such as automatic reference
counting and garbage collection
Manageability
ASP.NET employs a text-based, hierarchical configuration system, which

simplifies applying settings to your server environment and Web applications.
Because configuration information is stored as plain text, new settings may be
applied without the aid of local administration tools. This "zero local
administration"
philosophy
extends
to
deploying
ASP.NET
Framework
applications as well. An ASP.NET Framework application is deployed to a server

simply by copying the necessary files to the server.
Scalability and Availability
ASP.NET has been designed with scalability in mind, with features specifically
tailored to improve performance in clustered and multiprocessor environments.
Further, processes are closely monitored and managed by the ASP.NET runtime, so
that if one misbehaves (leaks, deadlocks), a new process can be created in its place,
which helps keep your applications constantly available to handle requests
Customizability and Extensibility
ASP.NET delivers a well-factored architecture that allows developers to "plug in"

their code at the appropriate level. In fact, it is possible to extend or replace any
subcomponent of the ASP.NET runtime with your own custom-written component.
Security
With built in Windows authentication and per-application configuration, you can

be assured that your applications are secure.
Language Support
The Microsoft .NET Platform currently offers built-in support for three languages:
C#, Visual Basic, and Scripts.
Language Compatibility
The differences between the VBScript used in ASP and the Visual Basic .NET
language used in ASP.NET are by far the most extensive of all the potential
migration issues. Not only has ASP.NET departed from the VBScript language to
"true" Visual Basic, but the Visual Basic language itself has undergone significant
changes in this release.
TOOL SELECTED: VB. NET
Visual Basic.Net is designed to be a fast and easy way to create .NET applications,
including Web services and ASP.NET Web applications. Applications written in
Visual Basic are built on the services of the common language runtime and take
full advantage of the .NET Framework.
Visual Basic .NET (VB.NET) is an object-oriented computer language that can be

viewed as an evolution of Microsoft's Visual Basic (VB) implemented on the
Microsoft .NET framework. Its introduction has been controversial, as significant
changes were made that broke backward compatibility with VB and caused a rift
within the developer community.
It is fully integrated with the .NET Framework and the common language
runtime,1 which together provide language interoperability, garbage collection,
enhanced security, and improved versioning support.
MICROSOFT SQL SERVER 2005
SQL Server 2005 exceeds dependability requirements and provides innovative

capabilities that increase employee effectiveness, integrate heterogeneous IT
ecosystems, and maximize capital and operating budgets. SQL Server 2005
provides the enterprise data management platform your organization needs to adapt
quickly in a fast-changing environment. With the lowest implementation and
maintenance costs in the industry, SQL Server 2005 delivers rapid return on your
data management investment. SQL Server 2005 supports the rapid development of
enterprise-class business applications that can give your company a critical
competitive advantage.
Easy-to-Use Business Intelligence
These tools through rich data analysis and data mining capabilities that integrate
with familiar applications such as Microsoft Office, SQL Server 2005 enable you
to provide all of your employees with critical, timely business information tailored
to their specific information needs. Every copy of SQL Server 2005 ships with a
suite of BI services.
Self-Tuning and Management Capabilities
Revolutionary self-tuning and dynamic self-configuring features optimize database

performance, while management tools automate standard activities. Graphical tools
and wizards simplify setup, database design, and performance monitoring,
allowing database administrators to focus on meeting strategic business needs.
Data Management Applications and Services
Unlike its competitors, SQL Server 2005 provides a powerful and comprehensive
data management platform. Every software license includes extensive management
and development tools, a powerful extraction, transformation, and loading (ETL)
tool, business intelligence and analysis services, and new capabilities such as
Notification Services. The result is the best overall business value available.
SQL Server 2005 Enterprise Edition
Enterprise Edition includes the complete set of SQL Server data management and
analysis features and is uniquely characterized by several features that make it the
most scalable and available edition of SQL Server 2005. It scales to the
performance levels required to support the largest Web sites, Enterprise Online
Transaction Processing (OLTP) systems and Data Warehousing systems. Its
support for failover clustering also makes it ideal for any mission critical line-ofbusiness application.
Top-10 Features of SqlServer-2005

1. T-SQL (Transaction SQL) enhancements
T-SQL is the native set-based RDBMS programming language offering highperformance data access. It now incorporates many new features including error
handling via the TRY and CATCH paradigm, Common Table Expressions (CTE),
which return a record set in a statement, and the ability to shift columns to rows
and vice versa with the PIVOT and UNPIVOT commands.
2. CLR (Common Language Runtime)
The next major enhancement in SQL Server 2005 is the integration of a .NET
compliant language such as C#, ASP.NET or VB.NET to build objects (stored
procedures, triggers, functions, etc.). This enables you to execute .NET code in the
DBMS to take advantage of the .NET functionality. It is expected to replace
extended stored procedures in the SQL Server 2000 environment as well as expand
the traditional relational engine capabilities.
3. Service Broker
The Service Broker handles messaging between a sender and receiver in a loosely
coupled manner. A message is sent, processed and responded to, completing the
transaction. This greatly expands the capabilities of data-driven applications to
meet workflow or custom business needs.
4. Data encryption
SQL Server 2000 had no documented or publicly supported functions to encrypt
data in a table natively. Organizations had to rely on third-party products to address
this need. SQL Server 2005 has native capabilities to support encryption of data
stored in user-defined databases.
5. SMTP mail
Sending mail directly from SQL Server 2000 is possible, but challenging. With
SQL Server 2005, Microsoft incorporates SMTP mail to improve the native mail
capabilities. Say "see-ya" to Outlook on SQL Server!
6. HTTP endpoints
You can easily create HTTP endpoints via a simple T-SQL statement exposing an
object that can be accessed over the Internet. This allows a simple object to be
called across the Internet for the needed data.
7. Multiple Active Result Sets (MARS)
MARS allow a persistent database connection from a single client to have more
than one active request per connection. This should be a major performance
improvement, allowing developers to give users new capabilities when working
with SQL Server. For example, it allows multiple searches, or a search and data
entry. The bottom line is that one client connection can have multiple active
processes simultaneously.
8. Dedicated administrator connection
If all else fails, stop the SQL Server service or push the power button. That
mentality is finished with the dedicated administrator connection. This
functionality will allow a DBA to make a single diagnostic connection to SQL
Server even if the server is having an issue.
9. SQL Server Integration Services (SSIS)
SSIS has replaced DTS (Data Transformation Services) as the primary ETL
(Extraction, Transformation and Loading) tool and ships with SQL Server free of
charge. This tool, completely rewritten since SQL Server 2000, now has a great
deal of flexibility to address complex data movement.
10. Database mirroring
It's not expected to be released with SQL Server 2005 at the RTM in November,
but I think this feature has great potential. Database mirroring is an extension of
the native high-availability capabilities. So, stay tuned for more details.
INFORMATION SUPER HIGHWAY:
A set of computer networks, made up of a large number of smaller networks, using

different networking protocols. The world's largest computing network consisting
of over two million computers supporting over 20 millions users in almost 200
different countries. The Internet is growing a phenomenal rate between 10 and 15
percent. So any size estimates are quickly out of date.
Internet was originally established to meet the research needs of the U.S Defence
Industry. But it has grown into a huge global network serving universities,
academic researches, commercial interest and Government agencies, both in the
U.S and Overseas. The Internet uses TCP/IP protocols and many of the Internet
hosts run the Unix Operating System.
HTML
HTML (Hyper Text Markup Language) is the language that is used to prepare
documents for online publications. HTML documents are also called Web
documents, and each HTML document is known as Web page.
A page is what is seen in the browser at any time. Each Web site, whether on the
Internet or Intranet, is composed of multiple pages. And it is possible to switch
among them by following hyperlinks. The collection of HTML pages makes up the
World Wide Web.
A web pages is basically a text file that contains the text to be displayed and
references of elements such as images, sounds and of course hyperlinks to other
documents. HTML pages can be created using simple text editor such as Notepad
or a WYSIWYG application such as Microsoft FrontPage.
In either case the result is a plain text file that computers can easily exchange. The
browser displays this text file on the client computer.
"Hypertext" is the jumping frog portion. A hyperlink can jump to any place within
your own page(s) or literally to anyplace in the world with a 'net address (URL, or
Uniform Resource Locator.) It's a small part of the html language.
5.6 INTERNET INFORMATION SERVER (IIS):

A web server is a program connected to the world wide web(www) that furnishes
resources from the web browser.
Microsoft IIS is a web server integrated with Windows.NET server that makes it
easy to publish information and bring business application to the web.
Because of its tight integration with Windows NT server, IIS guarantees the
network administrator and application developer the same security, Networking
and administrator functionality as windows NT server. Above and beyond its use of
familiar Windows NT server
Tools and functionality, IIS also has built-in capabilities to help administer secure
websites, and to develop server-intensive web application.
FEATURES OF IIS:
IIS provides integrated security and access to a wide range of content, work
seamlessly with COM components, and has a graphical interface-the Microsoft
Management Console (MMC) that you can use to create and manage your ASP
application.
IIS Provides Integrated Security:

On the internet, most sites allow anybody to connect to the site. The exceptions are
commercialists where you pay a onetime, monthly fee to access the site. Sites that
are restrict the access called secured site. Secured site use either integrated security
or login, password security. IIS support both of these methods.
IIS provides Access to Content:
All web servers can deliver HTML files, but they differ widely in how they treat
other types of content. Most servers let you add and modify Multi-purpose Internet
Mail Extensions (MMIE) types, but integrate directly into the windows registry.
That means IIS natively understands how to treat most common windows file
format, such as text (TXT) files, application initialization (INI) files, executable
(EXE) files and many others
IIS provides an Interface FOR COM
You can control many parts of IIS using COM>IIS exposes many of the servers
configuration settings via the IIS Admin objects. These objects are accessible from
ASP and other languages. That means you can adjust server configuration and
create virtual directories and webs programmatically. IIS 4 and higher store
settings and web information in a spoil database called the Metaphase. You can use
the IIS Admin objects to create new sites and virtual directories be alter the
properties of existing sites and virtual directories.
IIS ARCHITECTURES OVERVIEW:
IIS is a core product, which means that it is designed to work closely with many
other products, including all products in the Windows NT Server 4.0 Option pack.
The following figure shows the relationship between IIS and other products
installed as part of the Windows NT Server 4.0 Option pack.
SECURITY FOR IIS APPLICATION
IIS provides three authentication schemes to control access to ITS resources:
Anonymous, Basic and Windows NT challenge/Response. Each of these schemes
had different effect on the security context of an application launched by ITS. This
includes ISAPI extension agents, COT applications, IDC scripts and future
scripting capabilities.
ACCESS PRIVIEGES
IIS provides several new access levels. The following values can set the type of
access allowed to specific directories:
Read
Write
Script
Execute
Log Access
Directory Browsing.
IIS WEBSITE ADMINISTRATION
Administering websites can be time consuming and costly, especially for people
who manage large internet Service Provider (ISP) Installations. To save time and
money Sips support only large company web siesta the expense of personal
websites. But is there a cost-effective way to support both? The answer is yes; if
you can automate administrative tasks and let users administer their own sites from
remote computers. This solution reduces the amount of time and money it takes to
manually administer a large installation, without reducing the number of web sites
supported.
Microsoft Internet Information server (IIS) version 4.0 offers technologies to do

this:
1. Windows scripting Host (WSH)

2. IIS Admin objects built on top of Active Directory service Interface(ADS))
With these technologies working together behind the scenes, the person can
administers sites from the command line of central computer and can group
frequently used commands in batch files.Then all user need to do is run batch files
to add new accounts, change permissions, add a virtual server to a site and many
other tasks.
SOFTWARE REQUIREMENT SPECIFICATION
A software requirements specification (SRS) is a complete description of the
behavior of the software to be developed. It includes a set of use cases that
describe all of the interactions that the users will have with the software. In
addition to use cases, the SRS contains functional requirements, which define the
internal workings of the software: that is, the calculations, technical details, data
manipulation and processing, and other specific functionality that shows how the
use cases are to be satisfied. It also contains nonfunctional requirements, which
impose constraints on the design or implementation (such as performance
requirements, quality standards or design constraints).
The SRS phase consists of two basic activities:

1) Problem/Requirement Analysis:
The process is order and more nebulous of the two, deals with understanding the
problem, the goal and constraints.
2) Requirement Specification:
Here, the focus is on specifying what has been found giving analysis such as
representation, specification languages and tools, and checking the specifications
are addressed during this activity.
The Requirement phase terminates with the production of the validate SRS
document. Producing the SRS document is the basic goal of this phase.
Role of SRS:
The purpose of the Software Requirement Specification is to reduce the
communication gap between the clients and the developers. Software Requirement
Specification is the medium though which the client and user needs are accurately
specified. It forms the basis of software development. A good SRS should satisfy
all the parties involved in the system.
Date flow diagram

Level 0
New user registration
User registration
User details
Key generation
Encryption
Cloud Service Provider
Level 1
User login
Verify user Authentication

Output
User
Cloud
Service
Provider
Level 2
User upload data into csp
User Upload data
CSP
User
Keyword Search and Key verification
Encrypt data & store into cloud
Level 3
User
retrieve
data
from
csp
User select data to retrieve
CSP
User
Key verification
Decrypt data & download
Use case Diagram

User registration
Attribute based encryption
Key generation
Upload data into CSP
Encryption & store data into CSP

User
CSP
Keyword search
Key verification
Decrypt & download data
Class Diagram
Cloud service provider
Data owner
Csp name
User name
Company name
address
Contact details
Available space details
Attribute based key generation()

Maintain user details ()
Upload files()
Download files()
server
Maintain file details
Maintain key details
Verify key ()
Encrypt files()
Decrypt files()
Activity diagram
Modules:
Multi-keyword ranked search:
To design search schemes which allow multi-keyword query and provide
result similarity ranking for effective data retrieval, instead of returning
undifferentiated results.
Privacy-preserving:
To prevent the cloud server from learning additional information from the
data set and the index, and to meet privacy requirements. if the cloud server
deduces any association between keywords and encrypted documents from index,
it may learn the major subject of a document, even the content of a short document.
Therefore, the searchable index should be constructed to prevent the cloud server
from performing such kind of association attack.
Efficiency:
Above goals on functionality and privacy should be achieved with low
communication and computation overhead. Assume the number of query keywords
appearing in a document the final similarity score is a linear function of xi, where
the coefficient r is set as a positive random number. However, because the random
factor "i is introduced as a part of the similarity score, the final search result on the
basis of sorting similarity scores may not be as accurate as that in original scheme.
For the consideration of search accuracy, we can let follow a normal distribution
where the standard deviation functions as a flexible tradeoff parameter among
search accuracy and security.
SYSTEM TESTING
The purpose of testing is to discover errors. Testing is the process of trying

to discover every conceivable fault or weakness in a work product. It provides a
way to check the functionality of components, sub assemblies, assemblies and/or a
finished product It is the process of exercising software with the intent of ensuring
that the
Software system meets its requirements and user expectations and does not fail in
an unacceptable manner. There are various types of test. Each test type addresses a
specific testing requirement.
TYPES OF TESTS
Unit testing
Unit testing involves the design of test cases that validate that the internal
program logic is functioning properly, and that program inputs produce valid
outputs. All decision branches and internal code flow should be validated. It is the
testing of individual software units of the application .it is done after the
completion of an individual unit before integration. This is a structural testing, that
relies on knowledge of its construction and is invasive. Unit tests perform basic
tests at component level and test a specific business process, application, and/or
system configuration. Unit tests ensure that each unique path of a business process
performs accurately to the documented specifications and contains clearly defined
inputs and expected results.
Integration testing
Integration tests are designed to test integrated software components to

determine if they actually run as one program. Testing is event driven and is more
concerned with the basic outcome of screens or fields. Integration tests
demonstrate that although the components were individually satisfaction, as shown
by successfully unit testing, the combination of components is correct and
consistent. Integration testing is specifically aimed at exposing the problems that
arise from the combination of components.
Functional test
Functional tests provide systematic demonstrations that functions tested are

available as specified by the business and technical requirements, system
documentation, and user manuals.
Functional testing is centered on the following items:
Valid Input
: identified classes of valid input must be accepted.
Invalid Input
: identified classes of invalid input must be rejected.
Functions
: identified functions must be exercised.
Output
: identified classes of application outputs must be exercised.
Systems/Procedures: interfacing systems or procedures must be invoked.
Organization and preparation of functional tests is focused on requirements, key

functions, or special test cases. In addition, systematic coverage pertaining to
identify Business process flows; data fields, predefined processes, and successive
processes must be considered for testing. Before functional testing is complete,
additional tests are identified and the effective value of current tests is determined.
System Test
System testing ensures that the entire integrated software system meets
requirements. It tests a configuration to ensure known and predictable results. An
example of system testing is the configuration oriented system integration test.
System testing is based on process descriptions and flows, emphasizing pre-driven
process links and integration points.
White Box Testing

White Box Testing is a testing in which in which the software tester has
knowledge of the inner workings, structure and language of the software, or at least
its purpose. It is purpose. It is used to test areas that cannot be reached from a black
box level.
Black Box Testing

Black Box Testing is testing the software without any knowledge of the inner
workings, structure or language of the module being tested. Black box tests, as
most other kinds of tests, must be written from a definitive source document, such
as specification or requirements document, such as specification or requirements
document. It is a testing in which the software under test is treated, as a black
box .you cannot see into it. The test provides inputs and responds to outputs
without considering how the software works.
Unit Testing:
Unit testing is usually conducted as part of a combined code and unit test
phase of the software lifecycle, although it is not uncommon for coding and unit
testing to be conducted as two distinct phases.
Test strategy and approach

Field testing will be performed manually and functional tests will be written
in detail.
Test objectives
All field entries must work properly.
Pages must be activated from the identified link.
The entry screen, messages and responses must not be delayed.
Features to be tested
Verify that the entries are of the correct format
No duplicate entries should be allowed
All links should take the user to the correct page.
Integration Testing
Software integration testing is the incremental integration testing of two or

more integrated software components on a single platform to produce failures
caused by interface defects.
The task of the integration test is to check that components or software

applications, e.g. components in a software system or one step up software
applications at the company level interact without error.
Test Results: All the test cases mentioned above passed successfully. No defects
encountered.
Acceptance Testing
User Acceptance Testing is a critical phase of any project and requires

significant participation by the end user. It also ensures that the system meets the
functional requirements.
Test Results: All the test cases mentioned above passed successfully. No defects
encountered.
Conclusion:
The proposed novel T2S algorithm successfully implemented and to
efficiently return top-k results on massive data by sequentially scanning the
presorted table, in which the tuples are arranged in the order of round-robin
retrieval on sorted lists. Only fixed number of candidates needs to be maintained in
T2S. This paper proposes early termination checking and the analysis of the scan
depth. Selective retrieval is devised in T2S and it is analyzed that most of the
candidates in the presorted table can be skipped. The experimental results show
that T2S significantly outperforms the existing algorithm.
Future Enhancement:
In future development of Multi keyword ranked search scheme should
explore checking the integrity of the rank order in the search result from the un
trusted network server infrastructure.
Feature Enhancement:
A novel table-scan-based T2S algorithm implemented successfully to
compute top-k results on massive data efficiently. Given table T, T2Sfirst presorts
T to obtain table PT(Presorted Table), whose tuples are arranged in the order of the
round robin retrieval on the sorted lists. During its execution, T2S only maintains
fixed and small number of tuples to compute results. It is proved that T2S has the
Characteristic of early termination. It does not need to examine all tuples in PT to
return results.

Efficient Top-K Retrieval On Massive Data

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Efficient Top-K Retrieval On Massive Data

Загружено:

Авторское право:

Доступные форматы

Efficient Top K on Massive Data

The construction and incremental-update/batch-processing methods for the

A concrete index or view is constructed on a specific subset of attributes, the

Sorted-list-based methods retrieve the sorted lists in a round-robin fashion,

Sorted-list-based methods compute topk results by retrieving the involved

1) Best Position Algorithms for Top-k Query Processing in Highly

2) Supporting early pruning in top-k query processing on massive data

3) Efficient skyline computation on big data

Skyline is an important operation in many applications to return a set of interesting

4) Efficient processing of exact top-k queries over disk-resident sorted lists

monotonic linear scoring functions over disk-resident sorted lists. We introduce a

Then propose a dual-resolution layer structure. Top k query can be processed

If the attribute combinations of top-k query are fixed, index-based or viewbased

significantly compared with index-based (or view-based) methods, and enables

The evaluation of an information retrieval system is the process of assessing

Microsoft .NET provides the following:

Multiple language development

.NET Framework Overview

Common Language Runtime

.NET Framework Class Library

Common Language Specification

Microsoft itself is providing four CLS-compliant languages. VB.NET, C#, and C+

Visual Studio .NET 2008

management, text management, and user interface design features.

Support the development of Web-based applications and Web services. The

ASP.NET is a programming framework built on the common language runtime that

ASP.NET is compiled common language runtime code running on the server.

World-Class Tool Support

The ASP.NET framework is complemented by a rich toolbox and designer in the

Power and Flexibility

ASP.NET employs a text-based, hierarchical configuration system, which

applications as well. An ASP.NET Framework application is deployed to a server

Scalability and Availability

Customizability and Extensibility

ASP.NET delivers a well-factored architecture that allows developers to "plug in"

With built in Windows authentication and per-application configuration, you can

TOOL SELECTED: VB. NET

Visual Basic .NET (VB.NET) is an object-oriented computer language that can be

MICROSOFT SQL SERVER 2005

SQL Server 2005 exceeds dependability requirements and provides innovative

Easy-to-Use Business Intelligence

Self-Tuning and Management Capabilities

Revolutionary self-tuning and dynamic self-configuring features optimize database

Data Management Applications and Services

SQL Server 2005 Enterprise Edition

Top-10 Features of SqlServer-2005

INFORMATION SUPER HIGHWAY:

A set of computer networks, made up of a large number of smaller networks, using

5.6 INTERNET INFORMATION SERVER (IIS):

IIS Provides Integrated Security:

IIS WEBSITE ADMINISTRATION

Microsoft Internet Information server (IIS) version 4.0 offers technologies to do

1. Windows scripting Host (WSH)

The SRS phase consists of two basic activities:

Date flow diagram

Cloud Service Provider

Verify user Authentication

Keyword Search and Key verification

Encrypt data & store into cloud

User select data to retrieve

Decrypt data & download

Use case Diagram

Attribute based encryption