Вы находитесь на странице: 1из 120

Truth Discovery with Multiple Conflicting Information Providers on the Web By Sandesh A G

A PROJECT REPORT Submitted in partial fulfillment of the requirements for the award of the degree of MASTER OF TECHNOLOGY in Information Systems and Management

International School of Information Management University of Mysore June, 2012

BONAFIDE CERTIFICATE

Certified that this project report Truth Discovery with Multiple Conflicting Information Providers on the Web is the bonafide work of Sandesh A G, who carried out the Project work under our supervision. Certified that to the best of our knowledge the work reported herein does not form part of any other Project report or dissertation on the basis of which a degree or award was conferred on an earlier occasion on this or any other candidate

Signature Prof. Shalini R. Urs Executive Director International School of Information Management University of Mysore Mysore, India

Signature Mr. Mohammed Kaleem Project Guide

3 Signature Prof. Ramsesha Mudigere Supervisor International School of Information Management University of Mysore Mysore, India

4 ABSTRACT

In todays era the World Wide Web has become the most important source of information for most of the people. And there is no guarantee that the information that you are getting in web is true. For example, if we search for the height of Mountain Everest in web, we will end up with different solutions from different websites. So user will face difficulty to judge the correct answer. In this project we were trying to generate the new prototype for the published new problem called Veracity - Conformity to Truth, which helps to find the true facts from a large amount of conflicting information which is provided by different websites. For this we design a general framework for the invented algorithm called Truth Finder, which clarifies the relationship between websites and their information, i.e.; a website is trustworthy if it is providing many pieces of true information, and a piece of information is likely to be true if it is provided by many trustworthy websites. In this regard our prototype achieves a better result of finding out true facts from the conflicting information better than the popular existing search engines.

ACKNOWLEDGEMENTS

I would like to express my deep gratitude towards all who have supported me in my endeavors that have led to the successful completion of this project. I would like to extend my special thanks to authors Xiaoxin Yin, Jiawei Han, Philip S. Yu, based on their invented IEEE paper the project has been implemented. I would like to express my sincere thanks and highly thankful to my project guide Mr. Mohammed Kaleem for his valuable ideas, support, encouragement, supervision and useful suggestions throughout this project work. My heart fills with a deep sense of joy, as I sincerely thank my supervisor Prof. Ramsesha Mudigere for his constant support. I also highly thankful to Dr. Shalini R. Urs for her valuable support and encouragement, which kept me motivated and helped maneuver my focus in the right direction. I appreciate in all earnestness the support extended to me by Prof. Mandar R. Mutalikdesai for his whole hearted assistance in debugging the entire project flow and formatting the report. I also take this opportunity to thank the staff of International School of Information Management (ISiM), for their wholehearted support and encouragement. Last but not the least, I am grateful to all my class matesAbhishek Tripathi, Bukangwa Irene, for assisting me to carry out the formatting the report.

6 TABLE OF CONTENTS

BONAFIDE CERTIFICATE...........................................................................................................2 ABSTRACT.....................................................................................................................................4 ACKNOWLEDGEMENTS.............................................................................................................5 TABLE OF CONTENTS.................................................................................................................6 LIST OF FIGURES.........................................................................................................................7 LIST OF ABBREVIATIONS..........................................................................................................8 CHAPTER 1....................................................................................................................................9 INTRODUCTION...........................................................................................................................9 1.1 ORGANIZATION PROFILE.................................................................................................9 1.2. ABOUT TEAM.....................................................................................................................9 CHAPTER 2..................................................................................................................................10 LITERATURE SURVEY...............................................................................................................10 2.1 DATA QUALITY.................................................................................................................10 2.2 WEB MINING.....................................................................................................................10 2.3 LINK ANALYSIS................................................................................................................10 2.4. FEASIBILITY STUDY......................................................................................................10 CHAPTER 3..................................................................................................................................12 SYSTEM ANALYSIS....................................................................................................................12 3.1 PAGE RANK.......................................................................................................................12 3.2 AUTHORITY HUB.............................................................................................................12

7 3.3. EXISTING SYSTEM..........................................................................................................12 3.3.1. DISADVANTAGE.......................................................................................................12 3.4. PROPOSED SYSTEM........................................................................................................12 3.4.1. ADVANTAGE..............................................................................................................12 3.5. LIMITATIONS....................................................................................................................13 CHAPTER 4..................................................................................................................................14 PROBLEM FORMULATION.......................................................................................................14 4.1 PROBLEM DEFINITION:..................................................................................................14 4.2 OBJECTIVES......................................................................................................................14 4.3 HARDWARE SPECIFICATIONS.......................................................................................14 4.4 SOFTWARE SPECIFICATIONS........................................................................................14 4.5 SOFTWARE DESCRIPTION..............................................................................................15 4.5.1 BRIEF DESCRIPTION OF J2EE.................................................................................15 4.5.2 J2EE COMPONENTS...................................................................................................17 4.5.3 J2EE SERVER COMMUNICATIONS.........................................................................18 4.5.4 WEB COMPONENTS..................................................................................................19 4.5.5 BUSINESS COMPONENTS........................................................................................19 4.5.6 DATABASE ACCESS...................................................................................................19 4.5.7 JAVA SERVER PAGES TECHNOLOGY 1.2...............................................................19 4.5.8 INTRODUCTION TO JSP............................................................................................19 4.6 JAVA DATABASE CONNECTIVITY (JDBC) ..................................................................20 4.6.1 JDBC AND ODBC IN JAVA:.......................................................................................20 4.6.2 STRUCTURED QUERY LANGUAGE (SQL)............................................................20 CHAPTER 5..................................................................................................................................21

8 SYSTEM DESIGN........................................................................................................................21 5.1 MODULE DESCRIPTION..................................................................................................21 5.1.1 DESIGNING THE WEBPAGES...................................................................................21 5.1.2 COLLECTION OF DATA.............................................................................................21 5.1.3 DATA SEARCH............................................................................................................21 5.1.4 ONCLICK SEARCH.....................................................................................................21 5.1.5 TRUTHFINDER ALGORITHM...................................................................................21 5.1.6 RESULT CALCULATION...........................................................................................21 5.2. DEFINITIONS OF UML DIAGRAMS..............................................................................21 5.2.1 USE CASE DIAGRAM................................................................................................21 5.2.2 STATE DIAGRAM.......................................................................................................22 5.2.3 ACTIVITY DIAGRAM.................................................................................................22 5.2.4 CLASS DIAGRAM.......................................................................................................22 5.2.5 SEQUENCE DIAGRAM..............................................................................................22 5.2.6 DATA FLOW DIAGRAM.............................................................................................22 5.3 UML DIAGRAM.................................................................................................................22 5.3.1 USE CASE DIAGRAM 1 - QUERY SEARCH............................................................22 5.3.2 USE CASE DIAGRAM 2 - QUERY SEARCH............................................................24 5.3.3 STATE DIAGRAM - TRUTH FINDER.......................................................................26 5.3.4 ACTIVITY DIAGRAM TRUTH FINDER................................................................28 5.3.5 CLASS DIAGRAM.......................................................................................................32 5.3.6 SEQUENCE DIAGRAM TRUTH FINDER.............................................................34 5.3.7. DATA FLOW DIAGRAM TRUSTWORTHINESS.....................................................35 5.4 SYSTEM ARCHITECTURE...............................................................................................36

9 5.5 DESIGNING THE WEBPAGES.........................................................................................37 .6 DATA SEARCH.....................................................................................................................39 5.7 TRUTH FINDER ALGORITHM........................................................................................40 5.8 RESULT CALCULATION..................................................................................................41 CHAPTER 6..................................................................................................................................42 SYSTEM TESTING......................................................................................................................42 6.1 SOFTWARE TESTING:......................................................................................................42 6.2 UNIT TESTING...................................................................................................................42 CHAPTER 7..................................................................................................................................43 SYSTEM IMPLEMENTATION....................................................................................................43 7.1 RELATED WORK...............................................................................................................43 7.2. ALGORITHM USED..........................................................................................................43 7.3 ITERATIVE COMPUTATION............................................................................................45 CHAPTER 8..................................................................................................................................46 CONCLUSION AND FUTURE WORK.......................................................................................46 APPENDIX A................................................................................................................................46 A.1 SCREENSHOTS.................................................................................................................46 A.1.1.LOGIN PAGE...............................................................................................................46 A.1.2.HOME PAGE...............................................................................................................47 A.1.3.OBJECT LISTS............................................................................................................48 A.1.4.NORMAL SEARCH....................................................................................................49 A.1.5.ONCLICK SEARCH....................................................................................................49 A.1.6.TRUTH FINDER..........................................................................................................49 APPENDIX B................................................................................................................................50

10 CHAPTER B1................................................................................................................................50 B1.1 INTRODUCTION.............................................................................................................50 B1.2 PROBLEM DEFINITION AND ITS IMPORTANCE......................................................50 CHAPTER B2................................................................................................................................51 PAPER 1 .......................................................................................................................................51 B2.1 INTRODUCTION.............................................................................................................51 B2.2 PROBLEM DEFINITION.................................................................................................52 B2.3 METHODOLOGY USED.................................................................................................52 B2.4 DEFINITIONS .................................................................................................................52 B2.4.1 DEFINITION 1...........................................................................................................52 B2.4.2 DEFINITION 2...........................................................................................................52 B2.5 PROBLEM SOLVING APPROACH................................................................................52 B2.6 COMPUTATIONAL MODEL.........................................................................................53 B2.6.1 BASIC INFERENCE..................................................................................................54 B2.6.2 INFLUENCES BETWEEN FACTS...........................................................................55 B2.6.3 MANAGING OTHER COMPLEXITY.....................................................................56 B2.7 TRUTHFINDER ALGORITHM.......................................................................................56 B2.8 ITERATIVE COMPUTATION.........................................................................................57 CHAPTER B3................................................................................................................................58 PAPER 2........................................................................................................................................58 B3.1 ABSTRACT.......................................................................................................................58 B3.2 INTRODUCTION.............................................................................................................58 B3.3 SURVEY PERFORMED ON TRUTH INFORMATION.................................................58 B3.3.1 BRIEF DESCRIPTION OF WEBSITE INFORMATION FILTER SYSTEM..........58

11 B3.4 PROBLEM DOMAIN.......................................................................................................60 B3.4.1 CONFLICTING FACTS ANALYSIS.........................................................................60 CHAPTER B4................................................................................................................................61 COMPARITIVE ANALYSIS........................................................................................................61 B4.1 MEASURING TRUSTWORTHINESS ...........................................................................61 B4.2 HOW ARE THESE TWO (PAPER 1 AND PAPER 2) DIFFERENT...............................69 B4.3 ADVANTAGES AND LIMITATIONS OF PAPER 1.......................................................70 B4.3.1 ADVANTAGES..........................................................................................................70 B4.3.2 LIMITATIONS...........................................................................................................70 B4.4 ADVANTAGES AND LIMITATIONS OF PAPER 2.......................................................70 B4.4.1 ADVANTAGES..........................................................................................................70 B4.4.2 LIMITATIONS...........................................................................................................71 CHAPTER B5................................................................................................................................72 CONCLUSION AND FUTURE WORK.......................................................................................72 B5.1 CONCLUSION.................................................................................................................72 B5.2 FUTURE WORK..............................................................................................................72 REFERENCES..............................................................................................................................73

12 LIST OF FIGURES

13 LIST OF ABBREVIATIONS

API TF PR HITS PAPER 1 PAPER 2

: Application Programming Interface : Truth Finder : Page Rank : Hyperlink-Induced Topic Search : Truth Discovery with Multiple Conflicting Information Providers on the Web : Analysis of Conflicting Information and a study on Truth Finder Algorithm

14

CHAPTER 1

INTRODUCTION

In todays life style www has become the most important part of our lives and might have become the most important information source for most people. People used to retrieve all kind of data most frequently. For example when shopping online, people find product features and its specifications from websites like eBay, Flipkart.com. When looking for a particular electronic device like camera, they get the information and read reviews from websites such as reviews.cnet.com/digital-cameras. Certainly when they wanted to know much more information about a particular product then they used to go for Ask.com or Google. So can we accept whether the information that we are getting is correct? Is the world wide web always trustable? . Unfortunately the answer is no. So there is no guarantee that the information is correct. Even, different websites provides different types of unrelated and conflicting information as shown in following examples below.

1. Authors of books. When we tried to find out the Introduction to information security book we found so many links about different information. So from the image and the physical structure and the publication we found out that.

15 2. Height of Mount Everest. If you want to find out how high is Mount Everest is and searches in Google or Ask.com ie. Height of Mount Everest? , you will get list of conflicting information. Among top 10 results, selected four websites including Ask.com provides 29,035 feet, and five shows 29,028,and one shows 29,017 feet. So here you end with Conflicting information.

Todays internet users realized that trustworthiness problem of the Web. Based on a survey conducted by Princeton Survey Research in 2005 [12] states that 54 % of Internet users believe news and media websites most of the time and 26% of users trust online shopping websites and almost 12% of users trust blogs. There have been tremendous studies made for ranking webpages according to authority or popularity based on hyperlinks. Hence the most acceptable and sustainable methods are Authority Hub Analysis and Page Rank, which were adopted by www.google.com. But does authority leads to accuracy? The result is obviously NO. Because top ranked websites are normally most popular ones, but it doesnt mean that it provides accurate information. For e.g., a top ranked book stores like Barnes and Noble provides many errors on book author information where as a small bookstores like A1 Books provide more accurate information.

In this project, we used a new invented problem called Veracity, which is conformity to the truth. Which can be defined as, from a large amount of conflicting information, which is provided by many websites, how we have to find or discover the true facts about each search query or object? Here we used the word called fact to represent something which is provided as a fact for an object by some website, and it can positive or negative (true of false). In this project, we only use the facts which show relationships between two objects (e.g., Artists of Songs) or the properties of objects (e.g., Color of Camera or style of it). It is required for us, facts are parsed through webpages.

16 Here we make an assumption like a website is trustable or trustworthy if the facts provided by it are true. So because of this interdependency between facts and websites, we choose a method called Iterative computational method [1]. In this method, at each and every iteration the probabilities of facts being true and the trustworthiness of website is drastically changing from each other (previous one). This procedure is entirely different from the Authority Hub Analysis. Hence we cannot compute the trustworthiness of website by summing up the weights of facts as provided. Apart from that the alternative is probabilistic computation and second and most important is different facts influencing each other. For example: If a website says that artist of a song is Bruno M and another website says Bruno Mars then they actually support each other even though they provide slightly different facts. So we accept such influences between facts into the computational model.

Our experiments shows that as per Preliminary investigations TRUTHFINDER achieves high accuracy in discovering true facts, in comparison with Normal Search and Click Search.. It clearly says that TRUTHFINDER achieves accuracy in discovering true facts, and it can select better trustworthy websites than authority-based search engines.

1.1 ORGANIZATION PROFILE

Over the decade, Mindset, a Subsidiary of Spiro Technologies & consultant Pvt. Ltd provides a wide range of R & D project development training. Our uniqueness lies in the exclusive R & D project development. Accordingly, we created a setting that is enabling, dynamic and inspiring for the increase of solutions to global problems by R & D project development. We will develop the appropriate, responsible, innovative and practical solutions to our valuable Clients.

17 1.2. ABOUT TEAM

Our team consists of more than 300 enthusiastic experts, drawn from a range of disciplines and experience, supported by infrastructure and facilities, which are world class and distinctively state-of-the-art. The strength of the organization lies in not only identifying and articulating intellectual challenges across a number of disciplines of knowledge but also in mounting research, training and demonstration projects leading to development of specific problem-based advanced technologies. The organization growth has been evolutionary, driven by a vision of the future and ingrained in challenges frightening today. The organization continues to grow in size, spread and intensity of work undertaken. Our experts are involved in a wide range of R & D project development training to student wishing to undertake professional development, or just wanting to learn about a new subject or area of study.

18

CHAPTER 2

LITERATURE SURVEY

2.1 DATA QUALITY

Data quality is the quality of data. Data are of high quality "if they are fit for their intended uses in operations, decision making and planning" (J. M. Juran) [15]. Alternatively, the data are deemed of high quality if they correctly represent the real-world construct to which they refer [15]. These two views can often be in disagreement, even about the same set of data used for the same purpose.

During past time before the time of inexpensive servers, massive mainframe computers were used to maintain the users data like name and address, it ensures that the address could be properly routed to the correct destinations. Business rule was used by mainframe to correct common mistakes in spellings and small typographical mistakes in name and address. The powerfulness of this mainframe can be observed here, it helps to track customers who have changed the address or moved from past address, passed away (died), Being arrested, married, divorced or experienced other life changing events. Government began to distribute the data to few service companies to cross verify the available of customer information with National Change of Address registry (NCOA). At that time this technology saved a million of dollars to

19 the companies compared to manually correcting data. Large companies saved on postage, by making bills and marketing should deliver on the customer door. So in the starting it was available as a service, data quality entered into the corporations, results in low price and powerful server technology started available in market.

As defined by local postal authorities, name and address data has a perfect standard, where as other types of data have only few recognized standards. Now there is support movement in industry to address and standardize non-address data. In this movement group GS1 is one of them, where it is a nonprofit organization supporting this movement.

Companies with sound research efforts include, reducing measurement error, bounds checking of the data, data quality can include developing protocols for research methods, modeling and outlier detection, cross tabulation, verifying data integrity, etc.

2.2 WEB MINING

Web mining is the application of data mining techniques which is used to discover patterns from the web. We can clearly say that internet is the worlds largest data repository. As data size is tremendously increasing day by day, hour by hour and minute by minute basis. So by this we can easily conclude that there is huge potential to collect data through the web than any other means. This is because increasing number of individuals connecting to internet adds up. This technology is used to improve the intelligence of search engines and web marketing.

20 There is software called Web usage mining which uses data mining, which will analyze and discover the interesting pattern of users usage data on the web. This users data records the behavior of user browsing and transaction on web. This activity involves the automatic discovery of patterns from one or more web servers. Many organizations usually collect large amount of data, these data is automatically generated by web server and stored in server logs. So by analyzing these types of data helps the organization to analyze the customers, and to adopt necessary changes to business strategies across many products and the effectiveness of promotional strategies.

First web analysis tool simply given mechanism to fetch users activity as recorded in servers. So using this information it is easy to determine the number of accesses to server, the times and the time intervals of visit to particular URL and domain names of web servers. However the information provided by these is no longer usable for determination. So for this more sophisticated techniques have been discovered. This comes under: Pattern Discovery Tools and Pattern Analysis Tools.

Yet another interesting tool is Web Usage Mining in Web link recommendation. This is one of last trends which represent the online monitoring of pages to render personalized pages on the basis of similar visit patterns.

2.3 LINK ANALYSIS

The separation of an intellectual or material whole property into its constituent parts is for individual study. The study of such constituent parts and their inter relationships in making

21 up a whole. A spoken or written presentation of such study: published an analysis of poetic meter.

2.4. FEASIBILITY STUDY

TECHNICAL REQUIREMENTS Advanced Java J2EE SQL Server 2000 JAKARTA-TOMCAT-5.0.14 ECNOMICAL REQUIREMENTS Proposed system is cheaper Easily adaptable for both user and developer

22

CHAPTER 3

SYSTEM ANALYSIS

3.1 PAGE RANK

Page rank is link analysis algorithm, which was named by Lary Page [11]. It is used by Google search engine. Methodology of this is, it assigns a numerical weighting of hyperlinked set of documents, like World Wide Web. The purpose of this is to determine or measuring the weightage of links. And also many factors indicate PR thus it is poor indicator of how well a page rank of particular keywords. The word page has nothing to do with PR, it is from the name of the inventor.

3.2 AUTHORITY HUB

It is also known as Hyperlink Induced Topic Search (HITS). It is also link analysis algorithm which rates the webpages. It was developed by Jon Kleinberg In this analogy we have to consider websites as hubs and facts as authorities. In this regard the hub weight can be calculated by summing up the weights of authorities linked to it. .

23

3.3. EXISTING SYSTEM

Page Rank and Authority-Hub analysis is to utilize the hyperlinks to find pages with high authorities.

These two approaches identifying important web pages that users are interested in, unfortunately, the popularity of web pages does not necessarily lead to accuracy of information.

3.3.1. DISADVANTAGE

The popularity of web pages does not necessarily lead to accuracy of information.

Even the most popular website may contain many errors, whereas some comparatively not-sopopular websites may provide more accurate information.

24

3.4. PROPOSED SYSTEM

We used an existing method called the Veracity problem about how to discover true facts from conflicting information.

Second, we propose a framework to solve this problem, by defining the trustworthiness of websites, confidence of facts, and influences between facts.

Finally, we designed a framework* for existing algorithm called TRUTHFINDER for identifying true facts using iterative methods.

3.4.1. ADVANTAGE

Preliminary investigations indicate that TRUTHFINDER achieves high accuracy in discovering true facts, in comparison with Normal Search and Click Search.

It can select better trustworthy websites than authority-based search engines such as Google.

25

_____________________________________________________________________________ _ Note: * It is a part of team work in which the framework has been built.

3.5. LIMITATIONS

New users of the application need to get used to the application environment themselves as there are no help files included.

The Database has to be uploaded again by clicking the upload database and then search particular object.

26

CHAPTER 4

PROBLEM FORMULATION

4.1 PROBLEM DEFINITION:

In todays era the World Wide Web has become the most important source of information for most of the people. And there is no guarantee that the information that you are getting is in web is true. In this project we were trying to generate the new prototype for the published new problem called Veracity - Conformity to Truth, which helps to find the true facts from a large amount of conflicting information which is provided by different websites. .For this we design a general framework for the invented algorithm called Truth Finder, which achieves a better result of finding out true facts from the conflicting information better than the popular existing search engines.

4.2 OBJECTIVES

This project was implemented with the Web browser as the application client .The aim of the project is to implement TRUTHFINDER successfully finds true facts among

27 conflicting information, and identifies trustworthiness of websites better than the other popular search engines. So user no needs to go for another search.

4.3 HARDWARE SPECIFICATIONS

PROCESSOR RAM MONITOR HARD DISK CDDRIVE KEYBOARD MOUSE

: PENTIUM IV 2.6 GHz : 512 MB DD RAM : 15 COLOR : 20 GB : SPEED 52X : STANDARD 102 KEYS : 3 BUTTONS

4.4 SOFTWARE SPECIFICATIONS

FRONT END TOOL USED SERVER OPERATING SYSTEM BACK END

: JAVA, J2EE (JSP, SERVLET) : JCREATOR : JAKARTA-TOMCAT-5.0.14 : WINDOWS XP : SQL SERVER 2000

28

4.5 SOFTWARE DESCRIPTION

In this chapter we will discuss the topics which we have used in the project for developing purpose such as basic concepts of Java and SQL which was taken from the available notes. Henceforth I am going to discuss some necessary parts like JSP and SQL will be given below.

4.5.1 BRIEF DESCRIPTION OF J2EE

The J2EE platform uses a multitier distributed application model. Application logic is divided into components according to function, and the various application components that make up a J2EE application are installed on different machines depending on the tier in the multitier J2EE environment to which the application component belongs. Figure 4.1 shows two multitier J2EE applications divided into the tiers described in the following list. The J2EE application parts shown in Figure 4.1 are presented in J2EE Components.

Client-tier components run on the client machine. Web-tier components run on the J2EE server. Business-tier components run on the J2EE server. EIS-tier software runs on the EIS (Enterprise information system) server.

29 Although a J2EE application can consist of the three or four tiers shown in Figure 4.1, J2EE multitier applications are generally considered to be three-tiered applications because they are distributed over three different locations: client machines, the J2EE server machine, and the database or legacy machines at the back end. Three-tiered applications that run in this way extend the standard two-tiered client and server model by placing a multithreaded application server between the client application and back-end storage.

Application client Dynamic HTML Pages

JSP Pages
Enterprise Beans Enterprise Beans
Database Database

Client Tier Web Tier EIS Tier


Client Machine J2EE Server Machine DATABASE Server Machine

30

Figure 4.. Typical Architecture of Multi-Tiered Applications (Source: http://java.sun.com/j2ee/tutorial/1_3-fcs/doc/Overview2.html)

4.5.2 J2EE COMPONENTS

J2EE applications are made up of components. A J2EE component is a self-contained functional software unit that is assembled into a J2EE application with its related classes and files and that communicates with other components. The J2EE specification defines the following J2EE components:

Application clients and applets are components that run on the client. Java Servlet and Java Server Pages (JSP ) technology components are Web

components that run on the server.

31

Enterprise JavaBeans

(EJB

) components (enterprise beans) are business

components that run on the server. J2EE components are written in the Java programming language and are compiled in the same way as any program in the language. The difference between J2EE components and "standard" Java classes is that J2EE components are assembled into a J2EE application, verified to be well formed and in compliance with the J2EE specification, and deployed to production, where they are run and managed by the J2EE server.

4.5.3 J2EE SERVER COMMUNICATIONS

Figure 4.2 shows the various elements that can make up the client tier. The client communicates with the business tier running on the J2EE server either directly or, as in the case of a client running in a browser, by going through JSP pages or servlets running in the Web tier.

Your J2EE application uses a thin browser-based client or thick application client. In deciding which one to use, you should be aware of the trade-offs between keeping functionality on the client and close to the user (thick client) and off-loading as much functionality as possible to the server (thin client). The more functionality you off-load to the server, the easier it is to distribute, deploy, and manage the application; however, keeping more functionality on the client can make for a better perceived user experience.

Web Browser Web Pages, Applets, and

32
Optional JavaBeans Components Application Client and Optional JavaBeans Components

Web Tier Business Tier

J2EE Server Client Tier

Figure 4. Server Communications (Source: http://java.sun.com/j2ee/tutorial/1_3-fcs/doc/Overview2.html)

4.5.4 WEB COMPONENTS

33 J2EE Web components can be either servlets or JSP pages. Servlets are Java programming language classes that dynamically process requests and construct responses. JSP pages are text-based documents that execute as servlets but allow a more natural approach to creating static content.

Static HTML pages and applets are bundled with Web components during application assembly, but are not considered Web components by the J2EE specification. Serverside utility classes can also be bundled with Web components and, like HTML pages, are not considered Web components.

The Web tier might include a JavaBeans component to manage the user input and send that input to enterprise beans running in the business tier for processing.

4.5.5 BUSINESS COMPONENTS

Business code, which is logic that solves or meets the needs of a particular business domain such as banking, retail, or finance, is handled by enterprise beans running in the business tier. Figure 4.4 shows how an enterprise bean receives data from client programs, processes it (if necessary), and sends it to the enterprise information system tier for storage. An enterprise bean also retrieves data from storage, processes it (if necessary), and sends it back to the client program.

There are three kinds of enterprise beans: session beans, entity beans, and messagedriven beans. A session bean represents a transient conversation with a client. When the client

34 finishes executing, the session bean and its data are gone. In contrast, an entity bean represents persistent data stored in one row of a database table. If the client terminates or if the server shuts down, the underlying services ensure that the entity bean data is saved.

A message-driven bean combines features of a session bean and a Java Message Service ("JMS") message listener, allowing a business component to receive JMS messages asynchronously.

4.5.6 DATABASE ACCESS

The relational database provides persistent storage for application data. A J2EE implementation is not required to support a particular type of database, which means that the database supported by different J2EE products can vary. See the Release Notes included with the J2EE SDK download for a list of the databases currently supported by the reference implementation.

4.5.7 JAVA SERVER PAGES TECHNOLOGY 1.2

Java Server Pages technology lets you put snippets of servlet code directly into a text-based document. A JSP page is a text-based document that contains two types of text: static template data, which can be expressed in any text-based format such as HTML, WML, and XML, and JSP elements, which determine how the page constructs dynamic content.

35 4.5.8 INTRODUCTION TO JSP

The goal of the java server page specification is to simplify the creation and management of dynamic web page by separating content and presentation jsp as basically files that combine html and new scripting tags. The jsp there look somewhat like HTML but they get translated into java servlet the first time are invoked by a client. The resulting servlet is a combination of the html from the jsp file and embedded dynamic content specified by the new tag.

For our example in this c:/projavaserver/chapter11/jsp example hold our web application save the above in the file called simple jsp. Jsp and place it in the root of the web application (in the words save it as)

C: /localhost/project folder name/sourcename.jsp

You should output similar to the following.

In other words the first time jsp is loaded by the jsp container (also called the jsp engine)

The servlet code necessary to fulfill the jsp tags is automatically generated compiled and loaded into the servlet container. From then on as long as the jsp source for the page is not

36 modified this compiled servlet process any browser request for the jsp page. If you modify the mouse source code for the jsp it is automatically recompiled and relocated the next time that page is request

4.6 JAVA DATABASE CONNECTIVITY (JDBC)

4.6.1 JDBC AND ODBC IN JAVA:

Most popular and widely accepted database connectivity called Open Database Connectivity (ODBC) is used to access the relational databases. It offers the ability to connect to almost all the databases on almost all platforms. Java applications can also use this ODBC to communicate with a database. Then we need JDBC why? There are several reasons:

ODBC API was completely written in C language and it makes an extensive use of pointers. Calls from Java to native C code have a number of drawbacks in the security, implementation, robustness and automatic portability of applications.

ODBC is hard to learn. It mixes simple and advanced features together, and it has complex options even for simple queries.

ODBC drivers must be installed on clients machine.

37 4.6.2 STRUCTURED QUERY LANGUAGE (SQL)

SQL (Pronounced Sequel) is the programming language that defines and manipulates the database. SQL databases are relational databases; this means simply the data is store in a set of simple relations. A database can have one or more table. You can define and manipulate data in a table with SQL commands. You use the data definition language (DDL) commands to creating and altering databases and tables.

You can update, delete or retrieve data in a table with data manipulation commands (DML). DML commands include commands to alter and fetch data.

The most common SQL commands include commands is the SELECT command, which allows you to retrieve data from the database.

In addition to SQL commands, the oracle server has a procedural language called PL/SQL. PL/SQL enables the programmer to program SQL statement. It allows you to control the flow of a SQL program, to use variables, and to write error-handling procedures

38

CHAPTER 5

SYSTEM DESIGN

5.1 MODULE DESCRIPTION

5.1.1 DESIGNING THE WEBPAGES

In this module we are Designing All the Webpages with help of web components like Java server pages and Hyper Text Markup Language

5.1.2 COLLECTION OF DATA

First we have to collect the specific data about an object and it is stored in related database. Create table for specific object and store the facts about a particular object.

5.1.3 DATA SEARCH

39 System searches the related data link according to user input. In this module user retrieve the specific data about an object.

5.1.4 ONCLICK SEARCH

We design a general framework* for Onclick search along with Normal search. It works based on the number of clicks has been made to a particular hyperlink. Initially all links will be with the count of zero. i.e., for the first time it works same as the normal search. Once the user clicks on a hyperlink then for that hyperlink, the count in the database will be incremented and for the next search result that clicked hyperlink will display in the first with the list of other hyperlinks on the result page.

5.1.5 TRUTHFINDER ALGORITHM

We design a general framework* for the existing algorithm called Truth Finder, which clarifies the relationship between websites and their information, i.e.; a website is trustworthy if it is providing many pieces of true information, and a piece of information is likely to be true if it is provided by many trustworthy websites.

_____________________________________________________________________________ _

40 Note: * It is a part of team work in which the framework has been built.

5.1.6 RESULT CALCULATION

For each response of the query we are calculating the Performance. Using the count calculated find the best link and show as the output. Finally calculate the website trustworthiness.

5.2. DEFINITIONS OF UML DIAGRAMS

5.2.1 USE CASE DIAGRAM

It is methodology used in system analysis identify, clarify and organize system requirements. And it is made up of skeleton work of possible interaction between users and system for a particular model.

5.2.2 STATE DIAGRAM

It is methodology used to illustrate the states of on object which can attain and as well as transitions between them in the Unified Modeling Language (UML).Also called as State Machine diagram.

41 5.2.3 ACTIVITY DIAGRAM

It is a methodology used to describe the flow of control of the target system, by defining the complex business rules and operations.

5.2.4 CLASS DIAGRAM

It is a methodology which provides an overview of a system by showing its classes, interfaces, collaborations and relationships among them. It is useful in all types of Object Oriented Programming (OOP).

5.2.5 SEQUENCE DIAGRAM

It is a methodology which can map the scenarios which was described by the use case diagram in each step by step detailed procedure to show and achieve how objects collaborate to achieve the intended system goal.

5.2.6 DATA FLOW DIAGRAM

It is methodology which represents the flow of data through the business functions and representatives. It also explains that how data is processed and transferred in a system.

42 Graphical description shows that how each source acts with other sources of data to reach a common goal.

5.3 UML DIAGRAM

5.3.1 USE CASE DIAGRAM 1 - QUERY SEARCH

User

Login

Search query

Database

Logout Figure .1 Use Case Diagram - 1 - Query Search

43 The figure depicts that each user has to login with correct credentials to enter into search page. After he can search for query which loads the database and end up with a result for the asked query.

44 5.3.2 USE CASE DIAGRAM 2 - QUERY SEARCH

Truthfinder User Registration Login Load Database Normal search On-Click search Result Page

Figure 5.2 Use Case Diagram 2 The figure depicts that each user has to login with correct credentials, if not user is not registered, and then he can register by selecting registration page to enter into search page. After he can search for query with options for searching like Normal, Onclick, and Truth Finder search and end up with a result for the asked query.

45

46 5.3.3 STATE DIAGRAM - TRUTH FINDER

Home

Login

Query Selection

Database/Query/ Query processing

Conflicting information

Truth Finder Algorithm

Result/Ranking

Figure 5.3 State Diagram - Truth Finder The figure depicts the state of each case, after login next user will get the query selection, after the query processing and truth finder algorithm implementation user will get the result in the last state.

47 Home Login Query Selection Conflicting Information Truthfinder Algorithm Result/Rank Invalid
Validation Database

48 5.3.4 ACTIVITY DIAGRAM TRUTH FINDER

Figure 5.4 Activity Diagram - Truth Finder This diagram depicts the activity made from each transition to another transition, so that we can clearly identify the transitions after each state in our system.

User name

49
Password Confirm Password Email Mobil No Address State City Country Rigistration () Connect DB () User reg () Registration User name Password
Login () Connect DB () User reg ()

Login
Normal search () Page ranking search () Truth search () Load database ()

Search Phase

Search
Search Phase
Navigate URL () Get pages () Display pages ()

Normal Search
Navigate URL () Get ranking search () Display Pages ()

Search Phase Page Ranking Search


Get Truth Pages () Navigate pages () Display Pages ()

Search Phase Truth Find Search

50 5.3.5 CLASS DIAGRAM

Figure 5.5 Class Diagram It depicts the fields of each classes and the action taking place in each class. These were the classes used in our system.

51 Index Registration Login Search page Result Logout


User Information Successful Registration Username & Password Search Phase Normal Search On click Search Truth Find Search Logout

52 5.3.6 SEQUENCE DIAGRAM TRUTH FINDER

Figure 5.6 Sequence Diagram - Truth Finder After successful login of user it will take to search page. In search page user can select the search options from the 3 search options. Either Normal, Onclick and Truth Find search to get results. In the end user can logout after searching.

53 5.3.7. DATA FLOW DIAGRAM TRUSTWORTHINESS

Web sites Facts

Objects

W1
f1 f2

W2 W3 W4
f4 o1 o2 f3
Figure 5.7 Data Flow Diagram Trustworthiness Source: Truth discovery with multiple conflicting information providers on the web, [1]

54

Initially we have to consider each websites (w1, w2) are equally trustworthy. For objects (o1, o2) influencing facts are f1, f2, f3, f4. W1 is influencing f1 as well as f3. W2 is influencing f2. W3 is influencing f3 and w4 is influencing f3 again. So here for object o2 there is highest confidence of facts is f3.

55 5.4 SYSTEM ARCHITECTURE

Home Truth Discovery with Multiple Conflicting Information Providers On The Web Login Query On Click Truth Finder Normal Search Validation Correct Incorrect Conflicting Information Hit listed Page Truth Finder Information Result Figure 5.9 5.8 System Architecture Page

The figure depicts the entire architecture of the system. Start from the login page, once the user is validated user will enter into to query page. In query page he can select the type

56 of search and proceed with the query to get the corresponding results according to the query and search type.

5.5 DESIGNING THE WEBPAGES

Index About Us Contact Us Products


Login

Services New User About Us Page


Contact Us Page

Products Page
Products Page

Figure Figure5.10 5.9 Designing Designing the theWebpages WebPages Services Page

57 This figure shows the total number of webpages in front screen and how they are linked from one page to another.

5Search Page Search Word Database Server Retrieve Search Word Displayed Specific Information Figure 5.10 Data Search

58 .6 DATA SEARCH

This figure depicts how search happens in system. Start from search page, after selecting object it will look in database server and retrieve the information for the queried object and finally display the specific information.

59 5.7 TRUTH FINDER ALGORITHM

Search Conflicting Information Truth Finder Algorithm Iterative Computation Calculate Support Truth Finder Pages

Figure 5.11 Truth Finder Algorithm This figure depicts how the truth finder algorithm works in the system. Start from the conflicting information of an object it feeds into the TF algorithm, iteration happens for each of content to compare and calculate the confidence of score and accuracy and finally display the truth information about a queried object.

60 5.8 RESULT CALCULATION

Calculate Confidence And Fact Database Result

This figure shows the result calculation. Calculate the fact confidence and accuracy and store it in database and finally get the result.

Figure 5.12 5.14 Result Calculation

61

CHAPTER 6

SYSTEM TESTING

6.1 SOFTWARE TESTING:

The purpose of testing is to discover errors. Testing is the process of trying to discover every conceivable fault or weakness in a work product. It provides a way to check the functionality of components, sub-assemblies, assemblies and/or a finished product It is the process of exercising software with the intent of ensuring that the Software system meets its requirements and user expectations and does not fail in an unacceptable manner. There are various types of test. Each test type addresses a specific testing requirement.

For our project we used manual testing, because the content and context in our project is dynamically changing over time. Hence it is not possible to adopt Automation testing to our system.

6.2 UNIT TESTING

62 Unit testing involves the design of test cases that validate that the internal program logic is functioning properly, and that program input produces valid outputs. All decision branches and internal code flow should be validated. It is the testing of individual software units of the application .it is done after the completion of an individual unit before integration. This is a structural testing, that relies on knowledge of its construction and is invasive. Unit tests perform basic tests at component level and test a specific business process, application, and/or system configuration. Unit tests ensure that each unique path of a business process performs accurately to the documented specifications and contains clearly defined inputs and expected results.

63

CHAPTER 7

SYSTEM IMPLEMENTATION

7.1 RELATED WORK

The quality of information on the Web has always been a major concern for Internet users. There have been studies on what factors of data quality are important for users and on machine learning approaches for distinguishing high-quality and low-quality web pages, where the quality is defined by human preference. It is also shown that information quality measures can help improve the effectiveness of Web search. In 1998, two pieces of groundbreaking work, Page Rank and Authority-Hub analysis, were proposed to utilize the hyperlinks to find pages with high authorities. These two approaches are very successful at identifying important web pages that users are interested in, which is also shown by a subsequent study, the authors [1] propose a framework of link analysis and provide theoretical studies for many link-based approaches. Unfortunately, the popularity of web pages does not necessarily lead to accuracy of information. Two observations are made in authors [1] experiments: 1)Even though most popular websites provide many errors about many objects but whereas the websites which are not so popular might provide correct information 2) Instead of depending on the single website we can use different websites for more accurate information.

64 Initially TRUTHFINDER studies the interaction between websites and its facts and based on this it infers the website trustworthiness and confidence of facts from each other. We can make an assumption here between Authority hub analysis and the TRUTHFINDER. We can consider websites as hubs and facts as authorities. But in contradiction to this authority hub analysis is entirely different and it cannot be applied to this problem. In general the authority hub analysis the hubs can be computed by the adding up the weights of authorities linked to it. But in case of TRUTHFINDER its not true. Here the website trustworthiness can be computed by number of accurate facts which influences the object. And one important thing is the confidence of fact is not simply the sum of trustworthiness of websites instead it should be computed with some nonlinear transformations according to probabilistic theory. The authors propose an approach that uses the trust or distrust relationships between some users (e.g., user ratings on eBay.com) to determine the trust relationship between each pair of users.

TRUTHFINDER uses iterative methods to compute the website trustworthiness and fact confidence, which is widely, used in many link analysis approaches. The common feature of these approaches is that they start from some initial state that is either random or uninformative. Then each iteration, the approach will improve the current state by propagating information (weights, probability, trustworthiness, etc.) through the links. This iterative procedure has been proven to be successful in many applications, and thus, we adopt it in TRUTHFINDER.

65

7.2. ALGORITHM USED

Algorithm: TRUTHFINDER [1] Input: The set of Websites W and set of facts f and the links between them. Output: Website trustworthiness and fact confidence. Calculate matrices A and B For each website w W Step 1: Initially each website is equally trustworthy. Step 2: Compute and veracity problem exploration, calculate the trustworthiness score of website. Step 3: Calculate matrix B of inferring website trustworthiness from fact confidence. Step 4: compute confidence of fact f from the adjusted confidence score. Step 5: Store the value of truth value
Object 2

in

Object 4

Step 6: Calculate matrix A of inferring fact confidence from website trustworthiness. Step 7: Repeat from step 3 to 7 until it find the similarity between trustworthiness score from last iteration to current iteration.

Source: Truth finder with multiple conflicting information providers on the web. [1]

66

Table 7.1 Truth Finder Algorithm Notations Name M N w t(w) T(w) f s(f) (f) *(f) W(f) o(f) Imp(fjfk) Description Number of Web Sites Number of facts A web site The trustworthiness of w The trustworthiness score of w A fact The confidence of f The confidence score of f The adjusted confidence score of f The set of web sites providing f The object that f is about Implication from fj to fk Weight of objects about the same object Dampening factor Max difference between two iterations

Source: Truth finder with multiple conflicting information providers on the web. [1]

7.3 ITERATIVE COMPUTATION

67 As we know, we can infer the fact confidence if we know website trustworthiness and infer website trustworthiness if we know fact confidence. As in Authority-Hub analysis and Page Rank, TRUTHFINDER adopts an iterative method to compute the trustworthiness of websites and confidence of facts. Initially it has very minute information about all websites and facts. . In all (every) iteration, TRUTHFINDER tries to improve its knowledge about their trustworthiness and confidence, and it stops when the computation reaches a stable state. As in other iterative approaches, TRUTHFINDER needs an initial state. We choose the initial state in which all websites have uniform trustworthiness t0. i.e. t0 should be set to the estimated average trustworthiness, such as 0.9. From the website trustworthiness TRUTHFINDER can infer the confidence of facts, which are very meaningful because the facts supported by many websites are more likely to be correct. On the other hand, if we start from a uniform fact confidence, we cannot infer meaningful trustworthiness for websites. Before the iterative computation, we also need to calculate the two matrices A and B, as they are calculated once and used in all iteration. In each step of the iterative procedure, TRUTHFINDER first uses the website trustworthiness to compute the fact confidence and then re-computes the website trustworthiness from the fact confidence. The matrices are stored in sparse formats, and the computational cost of multiplying such a matrix and a vector is linear with the number of nonzero entries in the matrix. TRUTHFINDER stops iterating when it reaches a stable state. The stableness is measured in terms of how extent the trustworthiness of websites changes between iterations. If the change found out to be very little, then TRUTHFINDER (iteration) will stop.

68

CHAPTER 8

CONCLUSION AND FUTURE WORK

In this project, we used the Veracity problem, which aims at resolving conflicting facts from multiple websites and finding the true facts among them. We designed a prototype for invented TRUTHFINDER algorithm, an approach that utilizes the inter dependency between website trustworthiness and fact confidence to find trustable websites and true facts. Preliminary investigations indicate that TRUTHFINDER achieves high accuracy in discovering true facts and at the same time identifies websites that provide more accurate information, in comparison with Normal Search and Click Search.

The case of multiple true facts, same or false facts will be studied in future work. We can use PageRank with TRUTHFINDER to achieve still better result. But in real time, the time take by this system to retrieve result for a given query might be comparatively high as compare to normal search, because both the methods have their own standards to retrieve the results for a particular query. To overcome this problem we can adopt PageRank with TRUTHFINDER with some conditions with it. The General idea for achieving this problem is that, for example Google now gives thousands of links with result for the given query, so with this result set take only first set say, consider top 50 results and apply TRUTHFINDER to it. Continue the same method with the rest of result set. So by doing like this a system can achieve better and faster result. However it is necessary to consider set of rules to follow any method, for this I described one heuristic that

69 might be useful to follow this method, i.e., Apply TF with top results of PageRank first, and continue applying to rest of result set. Here we are assuming the top results of PageRank are most likely to be true with that PageRank context and we know that TRUTHFINDER achieves better result, so by using TRUTHFINDER with PageRank with this method we can retrieve better, faster result instead of applying TRUTHFINDER for the whole PageRank.

70

APPENDIX A

A.1 SCREENSHOTS

A.1.1.LOGIN PAGE

71

72 A.1.2.HOME PAGE

73 A.1.3.OBJECT LISTS

74 A.1.4.NORMAL SEARCH

75 A.1.5.ONCLICK SEARCH

76 A.1.6.TRUTH FINDER

77 APPENDIX B

CHAPTER B1

In this appendix, I am going to write the brief summary and methodologies used by the authors of white (research) papers given below,

PAPER 1: Truth Discovery with Multiple Conflicting Information Providers on the Web by, Xiaoxin Yin (UIUC), Jiawei Han (UIUC), Philip S. Yu (IBM T.J. Watson Res.Center), Feb 7, 2007.

PAPER 2: Analysis of Conflicting Information and a study on Truth Finder Algorithm by, Rathod Maheshwar and Ch. Kishore Kumar, Intrinity Engineering College, India, December 2011.

B1.1 INTRODUCTION

In todays era the World Wide Web has become the most important source of information for most of the people. And there is no guarantee that the information that you are getting in web is true. For example, if we search for the height of Mountain Everest in web, we

78 will end up with different solutions from different websites. So user will face difficulty to judge the correct answer. In this project we* were trying to generate the new prototype for the invented new problem called Veracity - Conformity to Truth, which helps to find the true facts from a large amount of conflicting information which is provided by different websites. For this we* design a general framework for the invented algorithm called Truth Finder, which clarifies the relationship between websites and their information, i.e., a website is trustworthy if it is providing many pieces of true information, and a piece of information is likely to be true if it is provided by many trustworthy websites. In this regard our prototype achieves a better result of finding out true facts from the conflicting information better than the popular existing search engines.

B1.2 PROBLEM DEFINITION AND ITS IMPORTANCE

In this section we will study how to describe the problem of finding true facts from the conflicting information provided by different websites for a particular subject. Here subject refers to the number of facts on a particular object, and object refers to user searching query. E.g. Specifications of camera were the facts and the camera refers to the object. The information provided by many websites usually have many conflicting information for an object and that object user will confuse to accept the answer.

For avoiding this conflicting information for each queried object here we were using a proposed algorithm called TRUTHFINDER [1]. The input to the TRUTHFINDER is the large

______________________________ Note: * It is a part of team work in which the framework has been built.

79 amount of facts from a queried object which are provided by different websites, and the responsibility of the TRUTHFINDER is to filter the true facts among them. By using this approach user can get the truth result for the query.

80

CHAPTER B2

PAPER 1

B2.1 INTRODUCTION

Nowadays the World Wide Web has become most important source of information for most of us. Suppose user searches for a query, then the user will get multiple results for that query and user might end up with accessing wrong result or unwanted output. Ultimately there is no guarantee the information that we are getting are true. For this they (authors) came up with a new problem called Veracity, Conformity to the truth. This helps to find true facts from the large amount of conflicting information. For this they designed a general framework for veracity problem and invented an algorithm called TRUTHFINDER, which differentiates the relationship between websites and their information.

For example, consider the Height of Mount Everest. If you want to find out how high is Mount Everest is? and searches in Google or Ask.com i.e. Height of Mount Everest? you will get list of conflicting information. Among top 10 results, selected four websites including Ask.com provides 29,035 feet, and five shows 29, 028, and one shows 29,017 feet. So here you end with Conflicting information.

81 According to the Princeton survey [3] on credibility which was conducted on 2005, 54% of users trust news websites, 26% of users trust online shopping websites and merely 12% for blogs.

They proposed a new problem called veracity, which is defined as given a large amount of conflicting information for an object from different websites, how to discover the true fact from the given object. So they used the word fact which used to represent properties of an object and claimed as fact by some websites. And this fact can be either true or false. Of course there are conflicting facts on the web, such as different specifications for the camera. A fact is likely to be true if it is provided by many trustworthy websites. A website is trustworthy if it provides most of true facts.

Web sites Facts

Objects

W1
f1 f2

W2 W3

82

W4
f4 o1 o2 f3
Figure B2.1 Input to Truth finder Trustworthiness Source: Truth discovery with multiple conflicting information providers on the web, [1] Initially we have to consider each websites (w1, w2, etc.) are equally trustworthy. For objects (o1, o2) influencing facts are f1, f2, f3, f4. W1 is influencing f1 as well as f3. W2 is influencing f2. W3 is influencing f3 and w4 is influencing f3 again. So here for object o2 there is highest confidence of facts is f3.

B2.2 PROBLEM DEFINITION

Large number of facts about an object will be provided as an input to TRUTHFINDER. These facts are provided by many websites for a queried object, which are conflicting each other. The goal of the TRUTHFINDER is to find the true fact among them.

83 B2.3 METHODOLOGY USED

Based on the inter-dependency between facts and websites they have chosen method which is given below,

Iterative computational method, consider at each iteration the probabilities of facts being true and the website trustworthiness are changing from each other. This iterative process is entirely different from authority hub analysis. Because,

First difference is the trustworthiness of websites does not depend on the how many facts it provides or the probability of fact being true by adding up the trustworthiness of websites providing it, but the accuracy of facts matters.

Secondly many facts influence each other. For example if the artist of a song given by one fact as Bryan A and another fact as Bryan Adam then these facts are influencing each other.

Finally they proposed a new algorithm called TRUTHFINDER for identifying true facts from the bunch of conflicting information using iterative method.

B2.4 DEFINITIONS

84 B2.4.1 DEFINITION 1

Confidence of Fact: The confidence of fact f [1] (denoted as s(f) ) the probability of fact being true according to best of our knowledge.

B2.4.2 DEFINITION 2

Trustworthiness of Website: The trustworthiness of web site w [1] denoted as t(w) expected number of facts provided by website w.

B2.5 PROBLEM SOLVING APPROACH

Suppose for an object many facts from different websites are conflicting. But sometimes facts also influence each other even if it is slightly different. For example one website gives information like an artist name of a song is Bryan A and another website gives name as Bryan Adams. Even information is not equal, but somehow it is influencing each other.

In order to show this relationship, they proposed the approach of suggestion or implication between facts. The implication from f1 to f2 is denoted as imp (f1 f2), which states that the confidence of f2 will be increased or decreased based on the confidence of f1. So the value of implication of facts f1, f2 is in between 1 to -1. A positive value indicates that if f1 is

85 correct then f2 is likely to be correct and negative indicates f1 is wrong then f2 is likely to be wrong.

The Rule of thumb for the computational model is given below, 1. Basically there is only one true fact for the details of the object. 2. This true fact appears to be same or somehow similar with different websites. 3. The wrong facts on one website are less likely to be similar on another website. 4. A website that mostly provides true facts for many objects will likely provide true facts for other objects too.

B2.6 COMPUTATIONAL MODEL

As we saw above, trustworthiness of website and fact confidence are determined by each other and we can use iterative method to compute both. According to the 3 rd rule true facts are most consistent than the false one.

Study of Website Trustworthiness and Fact Confidence In this we will first discuss the inference between website trustworthiness and Fact confidence.

Name M N

Description Number of Web Sites Number of facts

86 w t(w) T(w) f s(f) (f) *(f) W(f) o(f) Imp(fjfk) A web site The trustworthiness of w The trustworthiness score of w A fact The confidence of f The confidence score of f The adjusted confidence score of f The set of web sites providing f The object that f is about Implication from fj to fk Weight of objects about the same object Dampening factor Max difference between two iterations Table 1: Variables and Parameters of TRUTHFINDER Source: Truth Discovery with Multiple Conflicting information providers on the web by Xiaoxin Yin, Jiawei Han, Philip S. Yu Feb 7, 2007.

B2.6.1 BASIC INFERENCE

Here the trustworthiness of website is just the expected confidence of facts it gives. Say for website w, we compute the trustworthiness t(w) by taking the average confidence of facts provided by w.

t (w)= fF(w)s(f) / |F(w)|

(1)

87 Where F(w) is the set of facts provided by w. It is more difficult to find the true fact, if it is providing more than one fact for the same object.

t(w1)(w1)
W1

(f1)s(f1)
f1

t(w2)(w2)
W2

o1

(f2)
f2
W3

Figure B2.2 Computing Confidence of facts Source: Truth finder with multiple conflicting providers on the web [1]

First we will consider the simple case where consider there is only one fact f1 for object o1. So f1 is provided by w1 and w2. So if f1 is wrong then both w1 and w2 are wrong. Thus the probability of both being wrong is (1-t(w1)) . (1-t(w2)) and the probability of f1 being not correct is identified as 1 (1-t(w1)) . (1-t(w2)). The general form for if f is the only fact for an object then it can be defined as

88

s (f)=1Object 6

(2)

Where W (f) is the set of web sites providing f .

In equation (2) (1 t(w)) is quite small and multiplying of them lead to underflow. To avoid this problem, use logarithm and define it. By using this we can facilitate computation and veracity exploration to explore trustworthiness score of a website defined as

(w) = -ln(1 t(w)).

(3)

Where (w) ranges between 0 to + which represents how better w is. For instance, suppose there are two websites w1 and w2 with trustworthiness t(w1)=0.9 and t(w2)=0.99. In this we can see that w (2) has more weightage than w (1). But it does not infer much than t(w2)=1.1 * t(w1). For this we can define for better trustworthiness score, (w2) = 2 *(w1),which clarifies accuracy of websites.

(f) = -ln(1 s(f))

(4)

An important property which defines the confidence score of a fact is the sum of trustworthiness scores of the websites providing f. This is shown in lemma Consider Lemma 1. (f) =
Object 8

(5)

Proof, According to Equation (2),

1 s (f) =

wW ( f )

( 1t ( w ) )

89

Take the logarithm on both sides and we will get ln(1 s(f)) =
Object 14

(f) =
Object 16

B2.6.2 INFLUENCES BETWEEN FACTS

In this section we will discuss about the different facts of an object which influences each other. For figure B2.2 there are 2 facts f1 and f2 for an object, and the implication from f2 to f1 is very high then they are considered to be almost similar facts. i.e., if f2 is supported by many trustworthy websites then f1 is also supported by some websites. Therefore f1 is having reasonably high confidence. So we came to know that if we increase the score of f2 then f1 score will also increase. f1 should increase in such a way that the sum of trustworthiness score of websites providing f2. Adjusted confidence score can be defined as

*(f) = (f) + .
Object 18

(6)

90 is parameter between 0 to 1, which manages the influence of related facts. * is the sum of confidence score of and a part of confidence score of each related fact f multiplies with the implication of facts. imp (ff) < 0, when f is conflicting with f.

We can also compute the confidence of facts by using adjusted confidence score of f *(f) in the same way as (f) which is defined in equation 4. For this they have used s*(f) to represent this fact

s*(f) = 1 e-*(f)

(7)

B2.6.3 MANAGING OTHER COMPLEXITY

One problem with this model is that we considered the different websites are independent of each other. But in contrary to that contents can propagate from one website to another. In order to compensate for this high confidence they add a dampening factor into the above equation (7), the equation can be redefined as,

s*(f) = 1 e-.*(f) where >0 and <1.

Second the confidence of f can be easily become negative when there is fact f conflicting with trustworthy websites facts. Which makes *(f) < 0 and s(f) < 0. This is not

91 acceptable still with no evidences. The reason is that the fact f may contain some correct information. So it should be always above zero. So they have adopted logistic function with the equation (7) and can be defined as,

s (f) = 1 / 1 + e-. *(f)

(8)

When . *(f) is greater than zero then s*(f) will be normally equal to s(f). because equation (8) can be written as 1 - 1 + e-. *(f).

B2.7 TRUTHFINDER ALGORITHM

Input: The set of Websites W and set of facts f and the links between them. Output: Website trustworthiness and fact confidence. Calculate matrices A and B For each website w W Step 1: Initially each website is equally trustworthy. Step 2: Compute and veracity problem exploration, calculate the trustworthiness score of website. Step 3: Calculate matrix B of inferring website trustworthiness from fact confidence. Step 4: compute confidence of fact f from the adjusted confidence score. Step 5: Store the value of truth value
Object 20

in

Object 22

Step 6: Calculate matrix A of inferring fact confidence from website trustworthiness.

92 Step 7: Repeat from step 3 to 7 until it find the similarity between trustworthiness score from last iteration to current iteration.

B2.8 ITERATIVE COMPUTATION

As we discussed earlier the website trustworthiness can be inferred if we know the fact confidence and the fact confidence with website trustworthiness. Hence TRUTHFINDER adopts an iterative computation method where it computes the website trustworthiness and fact confidence. During initial stage it has a very little information about websites and facts. At each iteration TRUTHFINDER tries to enhance the knowledge of website trustworthiness and fact confidence. Iteration stops when it reaches the stable state (stable state explained later).

Initially each website will be considered as a equally trustworthy t 0. (t0 value is assumed as average trustworthiness of 0.9) In all iteration TRUTHFINDER first uses the website trustworthiness to find fact confidence, and then computes the fact confidence with website trustworthiness. It keeps on iterating until it reaches the stable state. Here stable state means change in the trustworthiness of each websites which is represented as stop.
wW (f )

(1t (w)

, and there is no much

difference between last iteration value to current iteration value, then the TRUTHFINDER will

93

CHAPTER B3

PAPER 2

B3.1 ABSTRACT

The World Wide Web has become the most important source of information for most of the generation today. But unfortunately there is no guarantee for the correctness of information that users were getting. Users will end up with a bunch of conflicting information for the query. So here the authors analysis provides a new problem called veracity, truth confirmation for the queried object, which is shows how to study truth facts from large amount of conflicting information. They came up with a general framework, which utilizes the TRUTHFINDER algorithm. The algorithm works on the basis of relationship between website and the information. They defined it as, a website is trustworthy if it provides most true facts and fact is likely to be true if it provided by many trustworthy websites. Their study show that using this system, TRUTHFINDER successfully achieves the good results compares to most of existing search engines.

B3.2 INTRODUCTION

94 Information quality plays a vital role in information technology. Quality of

information may vary from one task to another. User collected information may useful for one task at the same time its not sufficient for other task. Quality of piece of information might be useful for user for both tasks. But which quality aspects are relevant and who is consumer for that quality of information are required.

Earlier time scientists generated and came out with list of web attributes to consider the trustworthiness of websites [2]. For example, one evidence that has commonly used for this problem is i) Safeguard assurances, ii) marketers reputation, iii) ease of navigation, iv) robust order fulfillment, v) professionalism of the website, vi)the use of state of art web page design technology.

The World Wide Web has become the most important source of information for most of people. Each moment people retrieve different kinds of information and there is no guarantee that the retrieved information is true. For example if a customer wants to know the review of the movie a Wrong Turn then they might find the different information from different websites like imdb.com and rottentomatoes.com. Customers will confuse to choose the information from the set of conflicting information.

Nowadays in real world, web services considered as the new industrial standard for distributed computing and with the help of this to achieve new universal interoperability. By enabling this, the ability to exchange and use information with the help of web services we can use as a communication protocol to use business application integration effectively and efficiently. By using this with new upcoming technologies, we can find the new computational complexities and business challenges.

95 The main theme of the services computing field is to provide the service whenever needed called Service on demand. This will require the many preliminary processes like select a vendor place an order and check shipment status considered as a services. This is type of service are installed to meet the customer changing needs time to time.

B3.3 SURVEY PERFORMED ON TRUTH INFORMATION

Now let us study Conformity to truth which studies on how to find the true facts from the large amount of conflicting information for a queried object. They have used an algorithm called TRUTHFINDER which utilizes the relationship between website and their information. We considered as if the website is trustable if it provides many pieces of true facts, and the fact is likely to be true if it is provided by many trustworthy websites.

For retrieving trust information TRUTHFINDER uses the two parameters called website trustworthiness and fact confidence with some limitations. Initially the website trustworthiness is taken as 0.9. For some queries the result will be produces based on single object or property of it. e.g., If we want to calculate the distance, kilo meters and the time required to reach to destination may require many internal calculations and may degrade the performance in the aspect of Time. So time taken to fetch the result is high.

Basically an information provider on the web has an intention and the information knowledge that has been targeted to users which may vary from the quality of information. But, ultimately it is the problem for the users to select right information and go through with the result.

96 B3.3.1 BRIEF DESCRIPTION OF WEBSITE INFORMATION FILTER SYSTEM

Preprocessing! Mining Algorithms Pattern Analysis Site Files! Flaw logs! User session file! Rules Patterns! & Statistics Interesting Rules Patterns! & Statistics

The Website Information Filter(WebSIFT) system which uses the content and structure information from the website to identify interesting results from the users usage mining [4]. Web usage mining requires three types of domain information usage, content and structure. The preprocessing algorithms include identifying users server sessions and concluding cached page references through the use of the referrer field [2]. The structure of Web usage mining system is given below,

97

Figure B3.1 Web Usage Mining Source: WebSIFT The Website Information Filter System [4]

For web usage mining process input can be included as three server logs access, referrer and agent, the HTML files that make the website and extra information like user registration data or agent logs can provide data remotely. In preprocessing stage users data is used to construct user session page which can be used as a quotation for page topology and structure. With this data they can find the users behavior. This user session will be converted to transaction file and fed as an input to next stage pattern discovery. Both specifications of pages and page topology are fed into information filter. In addition to it, doing preprocessing and knowledge discovery, it uses the structure and preprocessed content about a website make a belief. The information filter uses this belief filter to fetch the interesting result set.

B3.4 PROBLEM DOMAIN

For a queried object the information might be provided by multiple websites and the information is likely to be conflicting. So here user will be in confused state to accept the search results. To overcome this problem they came up with the algorithm called TRUTHFINDER. It

98 takes large number of conflicting facts by different websites as input and the estimated goal of the system is to provide the true results or facts among them.

B3.4.1 CONFLICTING FACTS ANALYSIS

To overcome with the problem of conflicting facts they came up with the problem called Veracity, which is conformity to truth. i.e., given a set of conflicting information about an object from different websites or information providers we have to find true facts among them. Here the fact represents that, fact is a kind of information provided by each website for an object which claimed to be fact about a fact and it can be either true or false.

99

CHAPTER B4

COMPARITIVE ANALYSIS

There have been many studies made on ranking webpages according to Authority Hub analysis, PageRank and many general link based analysis. But unfortunately this doesnt lead to accuracy of information. Sometimes user might not aware of the information that they were getting. So at this time they may consider the wrong information as correct and end up with a wrong knowledge. To avoid this, the authors came up with the Problem called Veracity. It is defined as Conformity to truth. It provides the way to find the truth information from the bunch of conflicting information for the object. By designing the framework for this problem and using the algorithm called TRUTHFINDER, which studies on the relationship between websites and its contents, user will get better result.

For better results, we have to keep in mind that more accurate answers can be retrieved by using different websites instead on relying on single website. And even not so popular website might provide some accurate information whereas most of the time the popular website failed to provide this. Ex: Google top ranked websites.

100 TRUTHFINDER uses the iterative method to compute website trustworthiness and fact confidence. As we know that iterative computation proves that it successfully achieves the better result in many applications. Initially iteration starts with some value and at each iteration we can find the changing of values from previous state to current state. Here first it uses the website trustworthiness to compute fact confidence and vice versa.

The system has to implement with the Web browser as the application client. The aim of the system is to implement TRUTHFINDER successfully which finds true facts among conflicting information, and identifies trustworthiness of websites better than the other popular search engines. So users no need to go for another search.

B4.1 MEASURING TRUSTWORTHINESS

In this section we will study the general logic for true content comparison between different documents to find truth result.

1 1 2 A 2 3 B 1 3 C 3 4 D

Figure B4.1 General Logic of Truth Analysis

101

In above figure B4.1 we find A, B, C, D 4 documents. Suppose for document A if we want to search and compare with other remaining documents then it start comparing the content of A with B, C and D to find the document with highest truth result. The general logic of comparing the content of A with B, C, and D documents is shown below in Figure B4.2.

1 2 3 2 3 1 3 1 3 2 4 Weight is 2 Weight is 2 Weight is 3 B D C A

102 Total Weight For Document A = 7

Figure B4.2 Comparing Document As Content with remaining Documents

In above figure B4.2 we can see that, for comparing document A with B we found that 2 and 3 is present in both documents, so weight is 2. Document A with C we found that 1 and 3 is present in both documents, so weight is 2. Document A with D we found that 1, 2, and 3 is present in both documents, so weight is 3. By adding all the weights we get the total weight as 7 for document A.

2 3

103 1 2 3 1 3

1 3 2 4 Weight is 2 Weight is 1 Weight is 2 A D C B Total Weight For Document B = 5

104

Figure B4.3 Comparing Document Bs Content with remaining Documents

In above figure B4.3 we can see that, for comparing document B with A we found that 2 and 3 is present in both documents, so weight is 2. Document B with C we found that 1 is missing in document c and 3 is present in both documents, so weight is 1. Document B with D we found that similar contents are 2, and 3 is present in both documents, so weight is 2. By adding all the weights we get the total weight as 5 for document B

1 3

105 1 2 3 2 3 1 3 2 4 Weight is 2 Weight is 1 Weight is 2 A D B C Total Weight For Document C = 5

106

Figure B4.4 Comparing Document Cs Content with remaining Documents

In above figure B4.4 we can see that, for comparing document C with A we found that 1 and 3 is present in both documents, so weight is 2. Document C with B we found that 3 is present in both documents, so weight is 1. Document C with D we found that similar contents are 1, and 3 is present in both documents, so weight is 2. By adding all the weights we get the total weight as 5 for document C

1 3 2

107 4

1 2 3 2 3 1 3 Weight is 3 Weight is 2 Weight is 3 A C B D Total Weight For Document D = 7

108

Figure B4.5 Comparing Document Ds Content with remaining Documents

In above figure B4.5 we can see that, for comparing document D with A we found that 1, 2 and 3 is present in both documents, so weight is 3. Document D with B we found that 2 and 3 is present in both documents, so weight is 2. Document D with C we found that similar contents are 1, and 3 is present in both documents, so weight is 2. By adding all the weights we get the total weight as 5 for document D

Here we have to consider the one which has highest weight as most trustworthy (content) result followed by other results in descending order. The one which has same weights, then in this case first come first serve basis. Based on the results we can define the trustworthiness as A, D, B, C.

109 B4.2 HOW ARE THESE TWO (PAPER 1 AND PAPER 2) DIFFERENT

In this section I am going to discuss how paper 1 is different from paper 2. In Paper 1 they came up with a method called iterative computation. i.e., at all (each) iteration the probabilities of fact being true and trustworthiness of websites are changing from each other. As we can understand from above explanation, it is completely different from authority hub analysis. The first difference is 1) the trustworthiness of website will be calculated not from the number of facts it provides, but the confidence of those number of facts. 2) Different facts influence each other. For example one website the author of book is G. Witt and another website says Graeme Witt then the authors considered as these two facts influences each other even though it provides the slight different values.

Take an Example of TRUTHFINDER with partially supporting facts [1]. Consider the standard set of authors for a book is Standard set = {Graeme C. Smith and Graham Witt}, And the authors found by TF may be TRUTHFINDER = {Graeme Smith and G. Witt} We consider that above TRUTHFINDER author set is partially equal to standard author set. How this can be calculated? we assign different set of weights to different parts of authors name. The total weight for the author name is 1. The ratio assigned to Last name, first name and last name is 3:2:1. i.e. The ratio for Last name = 3, first name = 2 and last name = 1.

110

For example, in the above TF set Graeme smith gets the value of 5/6, because it omits the middle name. If the standard set has the full first name or middle name and the TF provides the correct initial then they assigns TF with half score of first name. For example, G. smith will get the score of 4/5 with respect to Graeme smith, because first name has the score of 2 and here initial G gets half of the score.

They have generated the set of rules which make the TRUTHFINDER to achieve the result efficiently. They took different sets of example to prove that their system works efficiently compare to most of existing search engines.

However in paper 2, its an abstract of paper 1 where they have used the generated problem veracity and algorithm TRUTHFINDER from paper 1, but analyzed the trustworthiness with different set of examples and used WebSIFT method for filtering data and mining websites.

As I explained earlier, WebSIFT uses content and structure of the website to identify the interesting result sets (detailed description is given in chapter 3 of appendix b).

Their expected data set contains basically contains authors of many books which was taken by different online book websites. For their search the result is 1265 computer science books published by Addison Wesley, McGraw Hill, and Morgan Kaufmann [2]. For each and every book they have used ISBN to search in www.abebooks.com which returns the book information and information about the stores which sells those books. The data set contains 894

111 bookstores and 34031 listings where you can find books. From their study on average each book has 5.4 different sets of authors. Here they apply the TRUTHFINDER to find out the set of authors for each book. As we know TRUTHFINDER uses iterative computation to do that. First to test its accuracy they have randomly selects 100 books and manually find outs its authors. Here they have considered the names which appear on the book cover and as well the image as standard fact. Then they have taken the result set of authors from TRUTHFINDER and standard result set to compute the accuracy. For certain books, suppose the standard fact indicates x authors and the TRUTHFINDER indicates y authors among them z authors belongs x authors. They have defined it as z max(x, y). Even this example also we will find in Truth Discovery with Multiple Conflicting Information Providers on the Web.

B4.3 ADVANTAGES AND LIMITATIONS OF PAPER 1

B4.3.1 ADVANTAGES

According to their experiment indicate that TRUTHFINDER achieves high accuracy in discovering true facts, in comparison with Normal search engine.

It can select better trustworthy websites than authority-based search engines such as Google.

112 B4.3.2 LIMITATIONS

The time taken to retrieve results for a given query is normally high compare to normal search.

New users of the application need to get used to the application environment themselves as there are no help files included.

B4.4 ADVANTAGES AND LIMITATIONS OF PAPER 2

B4.4.1 ADVANTAGES

By using WebSIFT technique one can achieve better data results with the TRUTHFINDER.

User will get the intended result for his query in one shot, so no need to go for one more search for the query.

113 B4.4.2 LIMITATIONS

Time taken to retrieve results for a query is high compare to normal search engine.

If the system is implemented in real time then their will more complexities to solve with and hence forth costlier to adopt the technique.

114

CHAPTER B5

CONCLUSION AND FUTURE WORK

B5.1 CONCLUSION

Nowadays for the users, the quality of information on the web has been a major concern [3]. Because there is no guarantee for information which was available in the web is correct. To overcome this the authors [1][2] came up with a problem called Veracity, which works for resolving conflicting facts from different websites and finding the trustable facts among those. For this they came up with a new algorithm called TRUTHFINDER. It works on the basis of interdependency between website trustworthiness and fact confidence to find the trustworthy website and true facts among conflicting information. Their experiments shows that TRUTHFINDER achieves the better accuracy at finding true facts and at the same time it also achieves good results at finding trustworthy websites which provide most accurate information on the web.

B5.2 FUTURE WORK

115 In future we can use PageRank with TRUTHFINDER to achieve still better result. But in real time, the time take by this system to retrieve result for a given query might be comparatively high as compare to normal search, because both the methods have their own standards to retrieve the results for a particular query. To overcome this problem we can adopt PageRank with TRUTHFINDER with some conditions with it. The General idea for achieving this problem is that, for example Google now gives thousands of links with result for the given query, so with this result set take only first set say, consider top 50 results and apply TRUTHFINDER to it. Continue the same method with the rest of result set. So by doing like this a system can achieve better and faster result. However it is necessary to consider set of rules to follow this method, for this I described one heuristic that might be useful to follow this method, i.e., Apply TF with top results of PR first, and continue applying to rest of result set. Here we are assuming the top results of PR are most likely to be true with that PageRank context and we know that TRUTHFINDER achieves better result, so by using TRUTHFINDER with PageRank with this method we can retrieve better, faster result instead of applying TRUTHFINDER for the whole PageRank.

116

117 REFERENCES

[1] Xiaoxin Yin (UIUC), Jiawei Han (UIUC), Philip S. Yu (IBM T.J. Watson Res.Center), Truth Discovery with Multiple Conflicting Information providers on the Web, Feb 7, 2007

[2] Rathod Maheshwar and Ch. Kishore Kumar, Intrinity Engineering College, India Analysis of Conflicting Information and a study on Truth Finder Algorithm Vol.2, No.9, December 2011.

[3] Princeton Survey Research Associates International, Leap of faith: Using the Internet Despite the Dangers, Results of a Natl Survey of Internet Users for Consumer Reports WebWatch, Oct. 2005.

[4] Robert Cooley, Pang-Ning Tan, Jaideep Srivastava , Department of Computer Science, University of Minnesota, WebSIFT The Website Information Filter System, June 13, 1999.

[5] Search engines like Google for referring and taking images for drawing figures that are in the document. https://www.google.co.in/imghp?hl=en&tab=ii.

118 [6] B. Amento, L.G. Terveen, and W.C. Hill, Does Authority Mean Quality? Predicting Expert Quality Ratings of Web Documents, Proc. ACM SIGIR 00, July 2000.

[7] M. Blaze, J. Feigenbaum, and J. Lacy, Decentralized Trust Management, Proc. IEEE Symp. Security and Privacy (ISSP 96), May 1996.

[8] A. Borodin, G.O. Roberts, J.S. Rosenthal, and P. Tsaparas, Link Analysis Ranking: Algorithms, Theory, and Experiments, ACM Trans. Internet Technology, vol. 5, no. 1, pp. 231297, 2005.

[9] J.S. Breese, D. Heckerman, and C. Kadie, Empirical Analysis of Predictive Algorithms for Collaborative Filtering, technical report, Microsoft Research, 1998.

[10] R. Guha, R. Kumar, P. Raghavan, and A. Tomkins, Propagation of Trust and Distrust, Proc. 13th Intl Conf. World Wide Web (WWW), 2004.

[11] G. Jeh and J. Widom, SimRank: A Measure of Structural-Context Similarity, Proc. ACM SIGKDD 02, July 2002.

119 [12 J.M. Kleinberg, Authoritative Sources in a Hyperlinked Environment, J. ACM, vol. 46, no. 5, pp. 604-632, 1999.

[13]Logistical

Equation

from

Wolfram

MathWorld,

http://mathworld.wolfram.com/LogisticEquation.html, 2008.

[14] T. Mandl, Implementation and Evaluation of a Quality-Based Search Engine, Proc. 17th ACM Conf. Hypertext and Hypermedia, Aug. 2006.

[15] L. Page, S. Brin, R. Motwani, and T. Winograd, The PageRank Citation Ranking: Bringing Order to the Web, technical report, Stanford Digital Library Technologies Project, 1998.

[16]

Sigmoid

Function

from

Wolfram

MathWorld,

http://mathworld.

wolfram.com/SigmoidFunction.html, 2008.

[17] R.Y. Wang and D.M. Strong, Beyond Accuracy: What Data Quality Means to Data Consumers, J. Management Information Systems, vol. 12, no. 4, pp. 5-34, 1997.

[18] Data quality and its importance http://en.wikipedia.org/wiki/Data_quality

120 [19] Distributed Multitiered Applications of J2EE http://java.sun.com/j2ee/tutorial/1_3-

fcs/doc/Overview2.html

Вам также может понравиться