Вы находитесь на странице: 1из 61

Efficient Techniques for Online Record Linkage

C NE T O T NS
Chapter 01 Abstract .. Project Purpose.. Project Scope Product Features..... Introduction Chapter 02

System Analysis ..

Problem Definition. Existing System Limitations of Existing System Proposed System.. Advantages of Proposed System Process Flow Diagrams for Existing and Proposed System. Feasibility Study.. Hardware and Software Requirements. Functional Requirements. Non Functional Requirements.

Literature Survey Chapter 03 System Design SDLC Spiral Model Project Architecture Module Description. UML Diagrams . Use case. Class Sequence. Activity.. Data Dictionary .. ER Diagram .. Chapter 04 Process Specification (Techniques And Algorithm Used)..................... Screen Shots.. . Chapter 05 Technology Description. Full Project Coding, Database with Video Tutorial

How to Install Document Chapter 06 Testing Block & White Box Testing. Unit Testing System Testing.. Integration Testing Test Case Table . Chapter 07 Conclusion . Limitations & Future Enhancements Reference & Bibliography

Abstract:
Matching records that refer to the same entity across databases is becoming an increasingly important part of many data mining projects, as often data from multiple sources needs to be matched in order to enrich data or improve its quality. Record linkage is the computation of the associations among records of multiple databases. It arises in contexts like the integration of such databases, online interactions and negotiations, and many others. Matching data from heterogeneous data source has been a real problem. A great organization must resolve a number of types of heterogeneity problems especially non uniformity problem. Statistical record linkage techniques could be used for resolving this problem but it causes communication bottleneck in a distributed environment. A matching tree is used to overcome communication overhead and give matching decision as obtained using the conventional linkage technique. Project Purpose: If the databases use the same set of design standards, this linking can easily be done using the primary key (or other common candidate keys). However, since these heterogeneous databases are usually designed and managed by different organizations (or different units within the same organization), there maybe no common candidate key for linking the records. Although it maybe possible to use common nonkey attributes (such as name, address, and date of birth) for this purpose, the result obtained using these attributes maynot always be accurate. This is because nonkey attribute values may not match even when the records represent the same entity instance in reality.

Project Scope: The databases exhibiting entity heterogeneity are distributed, and it is not possible to create and maintain a central data repository or warehouse where precomputed linkage results can be stored. A centralized solution maybe impractical for several reasons. First, if thedatabases span several organizations, the ownership and cost allocation issues associated with the warehouse could be quite difficult to address. Second, even if the warehouse could be developed, it would be difficult to keep it up-to-date. As updates occur at the operational databases, the linkage results would become stale if they are not updated immediately

Product Feature: An important issue associated with record linkage in distributed environments is that of schema integration. For record linkage techniques to work well, one should be able to identify the common nonkey attributes between two databases. If the databases are designed and maintained independentlyas is the case in most heterogeneous environmentsit would be necessary to develop an integrated schema before the common attributes can be identified.

Introduction:
1.1

GENERAL: The record-linkage problem identifying and linking duplicate records arises

in the context of data cleansing, which is a necessary pre-step to many database applications. Databases frequently contain approximately duplicate elds and records that refer to the same real-world entity, but are not identical. Importance of data linkage in a variety of data-analysis applications, developing effective an efficient techniques for record linkage has emerged as an important problem. It is further evidenced by the emergence of numerous organizations (e.g., Trillium, First Logic , Vality, Data Flux) that are developing specialized domain specic record-linkage and data-cleansing tools. The data needed to support these decisions are often scattered in heterogeneous distributed databases. In such cases, it may be necessary to link records in multiple databases so that one can consolidate and use the data pertaining to the same real world entity. If the databases use the same set of design standards, this linking can easily be done using the primary key (or other common candidate keys). However, since these heterogeneous databases are usually designed and managed by different organizations (or different units within the same organization), there may be no common candidate key for linking the records. Although it may be possible to use common non key attributes (such as name, address, and date of birth) for this purpose, the result obtained using these attributes may not always be accurate. This is because non key attribute values may not match even when the records represent the same entity instance in reality.

The databases exhibiting entity heterogeneity are distributed, and it is not possible to create and maintain a central data repository or warehouse where pre computed linkage results can be stored. A centralized solution may be impractical for several reasons. First, if the databases span several organizations, the owner ship and cost allocation issues associated with the warehouse could be quite difficult to address. Second ,even if the warehouse could be developed, it would be difficult to keep it up-to-date. As updates occur at the operational databases, the linkage results would become stale if they are not updated immediately .This staleness maybe unacceptable in many situations .For instance, in a criminal investigation, one maybe interested in the profile of crimes committed in the last 24 hours within a certain radius of the crime scene. In order to keep the warehouse current, the sites must agree to transmit incremental changes to the data warehouse on a real-time basis. Even if such an agreement is reached, it would be difficult to monitor and enforce it. For example, a site would often have no incentive to report the insertion of a new record immediately. Therefore, these changes are likely to be reported to the warehouse at a later time, thereby increasing the staleness of the linkage tables and limiting their usefulness. In addition, the overall data management tasks could be prohibitively time-consuming, especially in situations where there are many databases, each with many records, undergoing real-time changes. This is because the warehouse must maintain a linkage table for each pair of sites, and must update them every time one of the associated databases changes. The participating sites allow controlled sharing of portions of their databases using standard database queries, but they do not allow the processing of scripts, stored procedures, or other application programs from another organization .The issue here is clearly not one of current technological abilities, but that of management and control. If the management of an organization wants to open its databases to outside scripts from other organizations , there

are, of course, a variety of ways to actually implement it. However, the decision to allow only a limited set of database queries (and nothing more) is not based on technological limitations; rather it is often a management decision arising out of security concerns. More investment in technology or a more sophisticated scripting technique, therefore ,is not likely to change this situation. A direct consequence of this fact is that the local site cannot simply send the lone enquiry record to the remote site and ask the remote site to perform the record linkage and send the results back

Chapter 2 System Analysis: Problem Definition: We provide the research in the area of sequential information acquto provide an efficient solution to the online, distributed record linkage problem. The main benefit of the sequential approach is that, unlike the traditional full-information case, not all the attributes of all the remote records are brought to the local site instead, attributes are brought one at a time. After acquiring an attribute, the matching probability is revised based on the realization of that attribute, and a decision is made whether or not to acquire more attributes. By recursively acquiring attributes and stopping only when the matching probability cannot be revised sufficiently the sequential approach identifies, as possible matches, the same set of records as the traditional full-information case.

Existing System:

Heterogeneous databases are usually designed and managed by different organizations, so there may not be any common candidate key for linking the records. It is possible to use common non key attributes to access the heterogeneous database, but the result obtained using these attributes may not always be accurate.

When the matching records reside at a remote site, existing techniques cannot be directly applied because they would involve transferring the entire remote relation, thereby incurring a huge communication overhead. Record linkage techniques do not have an efficient implementation in an online, distributed environment and have mostly been confined to either local master files or to matching data from various sources in a batch processing mode.

Limitation: It has not been work on online. Not cost effective. It cannot be reduce communication overhead.

Proposed System:
An efficient technique is developed to facilitate record linkage decisions in a distributed, online setting. A matching tree is developed for attribute acquisition based on sequential decision making. The proposed techniques reduce the communication overhead considerably, and the linkage performance is assured to be at the same level as the traditional approach. Advantages: Reducing the communication overhead in a distributed environment cost effective model. It is cost effective model.

Feasibility Study: Objectives: In project development and feasibility studies stage of the SDLC, software engineers and developers should be able to: Identify the Business Value Developing a software is not just creating a good software and presenting them to the market with hopes that someone will actually use or buy it. Before the

software is even created, the idea will be evaluated again and again. This objective has to be fulfilled by the companys researchers. They will be presenting the current market situation, current users need or World Wide Web that software engineers can fill. Estimate Investment and Reward on the Project In project planning, the investment on a certain project has to be revealed. This will be the backbone of every project. For one, investment will dictate how much the company will spend to create certain software. This is even truer for companies that usually hire project based developers. Investment will dictate how many people will be working for the project. Aside from investment, project planning and feasibility studies should show how much the company will earn once the project is created. If its just a tool for businesses, it should show how it can increase the productivity of the employees and its actual impact in financial sense. Analyze Feasibility Researchers or proponents of the software will actually show why the software is needed in the actual business sense. Statistical data will always play a crucial role in proving why the particular software is good for public use. Most of the time, researchers will be interviewing individuals in order to get their opinion if they will use if that software is available in the market. It will also take a look at the possible competition and how will the software be different compared to other companies.

Outline Technical Needs

Procedures: Within this stage, these are the procedures usually followed by software development companies. At the end of this stage, developers should be able to set the stage for the next phase of project development. 1. Definition of business problem and scope What are the functions of the projected software? Who will use it? How will the company earn from it? The reason why the software should be created is properly defined. Developers have to understand these needs in order to create a favorable plan as a response to these problems.

2. Detailed Project Schedule is Outlined Once the business problem has been outlined and proper responses have been laid through the projected functions of software, project managers should be able to outline the timeframe when the software will be available for public or intended users. 3. Project Approval By this time, the projected software is not yet presented to the upper management. Small software developers may have to bypass this point since

they will be the same persons who will be approving the software. On the other hand, developers under larger companies will have to present their ideas to the upper management. This is where youll realize the importance of management support for SDLC. Their final decision on the project will determine if the software building will push through or not. 4. Economic, Organizational, Technical, Resource, and Schedule In this stage, everything will be placed on paper: the needed financial resources, the needed back-up software and platforms, hardware devices and even the project number of personnel who will responsible to build the software. 5. Resource Management After the paper, everything written on it should be implemented. First and foremost, additional hiring of developers should be in place. Hardware and software components should be installed and tested. Support staff should already be in place once production has started. 6. Project Launching When everything is in place, its time to get the ball rolling. Depending on the company size, project launching should be welcomed. There are companies who issue an official press release online and possibly in different media outlets. This will inform their potential clients and even competitors of the technological advantage of the company to create this type of software.

Hardware Requirement: Processor RAM Hard Disk Monitor : PENTIUM IV 2.6 GHz : 256Mb and above : 10 GB. : VGA and High Resolution Monitor.

Software Requirement: Front End Operating System Back End : Java : Windows : SQL Server 2005

FUNCTIONAL REQUIREMENT: 1 INTRODUCTION

Provide an overview of the system and some additional information to place the system in context. A software requirements specification (SRS) is a comprehensive description of the intended purpose and environment for software under development. The SRS fully describes what the software will do and how it will be expected to perform. An SRS minimizes the time and effort required by developers to achieve desired goals and also minimizes the development cost. A good SRS defines how an application will interact with system hardware, other programs and human users in a wide variety of real-world situations. Parameters such as operating speed,

response time, availability, portability, maintainability, footprint, security and speed of recovery from adverse events are evaluated. 2 PURPOSE

Provide an overall description of the FRD, its purpose. Reference the system name and identifying information about the system to be implemented. 3 SCOPE

Discuss the scope of the document and how it accomplishes its purpose of the project. 4 BACKGROUND Describe who is

Describe the organization and its overall responsibilities. producing the document and why. 5 REFERENCES

List references and controlling documents, including: meeting summaries, white papers, other deliverables, etc. 6 ASSUMPTIONS AND CONSTRAINTS

Provide a list of contractual or task level assumptions and/or constraints that are preconditions to preparation of the FRD. Assumptions are future situations beyond the control of the project, whose outcomes influence the success of a project. 7 ASSUMPTIONS

Examples of assumptions include: availability of a technical platform, legal changes and policy decisions. 8 CONSTRAINTS

Constraints are boundary conditions on how the system must be designed and constructed. Examples include: legal requirements, technical standards, strategic decisions. Constraints exist because of real business conditions. For example, a delivery date is a constraint only if there are real business consequences that will happen as a result of not meeting the date. If failing to have the subject application operational by the specified date places the organization in legal default, the date is a constraint. Preferences are arbitrary. For example, a date chosen arbitrarily is a preference. Preferences, if included in the FRD, should be noted as such. 9 DOCUMENT OVERVIEW

Provide a description of the document organization.

9.1

USER REQUIREMENTS

Provide requirements of the system, user or business, taking into account all major classes/categories of users. Provide the type of security or other List the functional As the functional distinguishing characteristics of each set of users. requirements that compose each user requirement. traced to the user requirements.

requirements are decomposed, the highest level functional requirements are Inclusion of lower level functional requirements is not mandatory in the traceability to user requirements if the parent requirements are already traced to them. User requirement information can be in text or process flow format for each major user class that shows what inputs will initiate the system functions, system interactions, and what outputs are expected to be generated by the system. The scenarios should be comprehensive, to the extent that all user types

and all major functions are covered. Give each user requirement a unique number. Typically, user requirements have a numbering system that is separate from the functional requirements. Requirements may be labeled with a leading U or other label indicating user requirements. 9.2 DATA FLOW DIAGRAMS

Decompose the context level diagrams to determine the functional requirements. design. 9.3 LOGICAL DATA MODEL/DATA DICTIONARY Data flow diagrams should be decomposed down to the These diagrams are further decomposed during functional primitive level.

Create the initial Logical Data Model. Describe data requirements by providing data entities, decomposition, and definitions in a data dictionary. The data requirements describe the business data needed by the application system. Data requirements do not describe the physical database and are not at the level of identifying field names. NON FUNCTIONAL REQUIREMENT: Non-Functional Requirements are not really requirements at all. Rather, they are constraints on implementing the functional requirements as defined in the use case documents and other documents or models. However, for the purposes of requirements management they are considered to be requirements and, as such, need to be tested. The first rule of defining non-functional requirements, therefore, is to ensure that they are testable. A requirement that cannot be tested may as well not be included as a requirement. The best way to ensure that nonfunctional requirements can be tested is to think about how they might be tested when they are written. If it is not obvious then ask a test developer how they would test it.

Other guidelines should include whether they are written in a way that is unambiguous and easy to understand. Some of the requirements may be very technical in nature so not all can be expected to be understood by non-technical people. However, those who will have to implement them and those who will have to test them should be able to understand them easily and completely. Non-functional requirements may be present in other documents and models where they have been added as notes or as high-level requirements. These include the stakeholder needs list, the project overview, the business process model, and the use case documents. 1 PROJECT OVERVIEW

The chapter 1 one contain the Project Overview summary of known major technical constraints. Add these to the non-functional requirements document and update the project overview with references to them. 2 STAKEHOLDER NEEDS LIST

These are single line statements of needs gather early in the project from all stakeholders. Scan through the list and look for those needs that would not have been included in the use case documents. Add them to the appropriate section of the non-functional requirements documents as single sentence statements. Update the stakeholder needs list with the reference to the requirement in the non-functional requirements document. 3 BUSINESS PROCESS MODEL

When the business process model is created notes may have been added to the processes as attachment to activities on activity diagrams regarding stakeholder requirements as they have been mentioned by the stakeholder representative or process owner. Search all relevant activities for notes relating to non-functional requirements and add them to the non-functional requirements document.

Update the activity notes with the reference to the requirement in the nonfunctional requirements document. 4 USE CASE DOCUMENTS

Use Case Documents have a specific section for non-functional requirements. These should normally apply specifically to the use case in question. It may be, however, that requirements that apply across more than one use case have been included. If so, add the requirements to the non-functional requirements document and reference them in the use case document. 5 USABILITY Resist making statements such as The system shall be user friendly. It is entirely insufficient. The non-functional requirements document template breaks the section down into seven subsections and prompts for completion of each section. Be specific about the mechanism by which each aspect will be met. For example, in the section Learnability say whether the user is expected to learn how to use the system by looking up the help or using a tutorial or whether the system can be put into a learner mode whereby it guides a learner user through a use case by offering detailed prompts at every step. If there is to be no documentation at all and the system is to be learnt from the developers demonstrating it to the users, then say so. 6 RELIABILITY Reliability is notoriously difficult to test before the system goes live as it relies on continuous use and metrics. Some guidance here on what is realistic to expect should be sought in consultation with stakeholders. Figures should be defined on the basis of the metrics available for existing systems and the expectations for improvement or relaxation of these values. It will only be known if the system meets these requirements if proper metrics and

requirements gathering are put in place. Notes to this effect should be included. If no serious attempt is to be made to do this, then write None in each section. 7 PERFORMANCE Performance can and should be tested early on the development as part of the architectural development phase. Care needs to be taken to specify throughput and response times in terms of the creation and processing of major data entities with reference to specific use cases in the use case model. Ensure that the requirements are written in such a way that the testing of them will be straightforward. If the system is to share infrastructure resources with other systems specify what proportion of those resources must be available if the response times are to be met. Also include any dependencies on outside systems and specify how these are expected to respond if resource and throughput targets are to be met.

SECURITY Security includes all steps that are to be taken to secure the system

against both voluntary and involuntary corruption. This includes management of usernames and passwords; encryption of data transfers both internally and across external systems such as the internet; firewalls and protection against viruses, Trojans, worms and all kinds of malicious code attacks including denial of service. It may be useful to refer to the standards specified in section 8.3 under implementation constraints.

9 SUPPORTABILITY

Specify ease of installation, configuration and testing in terms both of the time to achieve the goal and specific means for achieving it. Consider, for example, installation software and scripts, use cases for configuration and automatic self-testing. Think carefully about how each of these requirements will actually be tested.

10 INFRASTRUCTURE REQUIREMENTS These should include description of existing or new hardware, software and networks on which the system is expected to run. Create a deployment diagram, or equivalent which shows all processing nodes, peripherals and communication links. Specify the required capacity for each of these, or, where the system runs on existing infrastructure, the amount of capacity that needs to be available for the system to meet its performance requirements. Also specify any external systems of services upon which this system will depend in terms of the performance that is required of these systems if this system is to meet its performance targets.

11 IMPLEMENTATION CONSTRAINTS A constraint is a requirement which leaves no design option. Implementation constraints, rather than describing what the system will do, describe constraints on the design by which what it is to do will be achieved. If there is no constraint in a section, e.g. the developers could use any language they like then say so. Otherwise describe just the constraint. When referring to system interfaces, legacy systems and databases refer to the design documentation for these. Add important diagrams to Appendix A and refer to them in the text. If there is insufficient information about these external systems then mention that this information will need to be completed for the purposes of the development of this system.

Literature Survey: 1Linked Record Health Data Systems: The goal of record linkage is to link quickly and accurately records that correspond to the same person or entity. Whereas certain patterns of agreements and disagreements on variables are more likely among records pertaining to a single person than among records for different people, the observed patterns for pairs of records can be viewed as arising from a mixture of matches and non matches. Mixture model estimates can be used to partition record pairs into two or more groups that can be labeled as probable matches (links) and probable non matches. A method is proposed and illustrated that uses marginal information in the database to select mixture models, identifies sets of records for clerks to review based on the models and marginal information, incorporates clerically reviewed data, as they become available, into estimates of model parameters, and classifies pairs as links, non links, or in need of further clerical review. The procedure is illustrated with five datasets from the U.S. Bureau of the Census. It appears to be robust to variations in record-linkage sites. The clerical review corrects classifications of some pairs directly and leads to changes in classification of others through re estimation of mixture models. 2 Efficient Private Record Linkage: Record linkage is the computation of the associations among records of multiple databases. It arises in contexts like the integration of such databases, online interactions and negotiations, and many others. The autonomous entities who wish to carry out the record matching computation are often reluctant to fully share their data. In such a framework where the entities are unwilling to share data with each other, the problem of carrying out the linkage computation without full data exchange has been called private record linkage. Previous private record linkage techniques have made use of a third party. We provide efficient techniques for private record linkage that improve on previous work in that (i) they make no use of a third party; (ii) they achieve much better performance than that of previous schemes in terms of execution time and quality of output (i.e., practically without false negatives and minimal false positives). Our software implementation provides experimental validation of our approach and the above claims.

3 A Survey of Approaches to Automatic Schema Matching: Schema matching is a basic problem in many database application domains, such as data integration, Ebusiness, data warehousing, and semantic

query processing .In current implementations, schema matching is typically performed manually, which has significant limitations. On the other hand, previous research papers have proposed many techniques to achieve a partial automation of the match operation for specific application domains. We present a taxonomy that covers many of these existing approaches, and we describe the approaches in some detail. In particular ,we distinguish between schema-level and instance-level, element-level and structure-level, and language-based and constraint-based matchers. Based on our classification we review some previous match implementations thereby indicating which part of the solution space they cover .We intend our taxonomy and review of past work to be useful when comparing different approaches to schema matching, when developing a new match algorithm, and when implementing a schema matching component. 4 A Comparison of Fast Blocking Methods for Record Linkage: The task of linking databases is an important step in an increasing number of data mining projects, because linked data can contain information that is not available otherwise, or that would require time-consuming and expensive collection of specific data. The aim of linking is to match and aggregate all records that refer to the same entity. One of the major challenges when linking large databases is the efficient and accurate classification of record pairs into matches and non-matches. While traditionally classification was based on manually-set thresholds or on statistical procedures, many of the more recently developed classification methods are based on supervised learning techniques. They therefore require training data, which is often not available in real world situations or has to be prepared manually, an expensive, cumbersome and timeconsuming process. The author has previously presented a novel two-step approach to automatic record pair classification. In the first step of this approach, training examples of high quality are automatically selected from the compared record pairs, and used in the second step to train a support vector machine (SVM) classifier. Initial experiments showed the feasibility of the approach, achieving results that outperformed k-means clustering. In this paper, two variations of this approach are presented. The first is based on a nearest neighbor classifier, while the second improves a SVM classifier by iteratively adding more examples into the training sets. Experimental results show that this two-step approach can achieve better classification results than other unsupervised approaches. 5 A Method for Calibrating False-Match Rates in Record Linkage : Specifying a record-linkage procedure requires both (1) a method for measuring closeness of agreement between records, typically a scalar weight,

and (2) a rule for deciding when to classify records as matches or non matches based on the weights. Here we outline a general strategy for the second problem, that is, for accurately estimating false-match rates for each possible cutoff weight. The strategy uses a model where the distribution of observed weights are viewed as a mixture of weights for true matches and weights for false matches. An EM algorithm for fitting mixtures of transformed-normal distributions is used to find posterior modes; associated posterior variability is due to uncertainty about specific normalizing transformations as well as uncertainty in the parameters of the mixture model, the latter being calculated using the SEM algorithm. This mixture-model calibration method is shown to perform well in an applied setting with census data. Further, a simulation experiment reveals that, across a wide variety of settings not satisfying the model's assumptions, the procedure is slightly conservative on average in the sense of overstating false-match rates, and the one-sided confidence coverage (i.e., the proportion of times that these interval estimates cover or overstate the actual false-match rate) is very close to the nominal rate. 6 Applying Model Management to Classical Meta Data Problems: We wish to measure the evidence that a pair of records relates to the same, rather than different, individuals. The paper emphasizes statistical models which can be fitted to a file of record pairs known to be correctly matched, and then used to estimate likelihood ratios. A number of models are developed and applied to UK immigration statistics. The combination of likelihood ratios for possibly correlated record fields . 7 Record Linkage Current Practice and Future Directions: Record linkage is the task of quickly and accurately identifying records corresponding to the same entity from one or more data sources. Record linkage is also known as data cleaning, entity reconciliation or identification and the merge/purge problem. This paper presents the standard probabilistic record linkage model and the associated algorithm. Recent work in information retrieval, federated database systems and data mining have proposed alternatives to key components of the standard algorithm. The impact of these alternatives on the standard approach are assessed. The key question is whether and how these new alternatives are better in terms of time, accuracy and degree of automation for a particular record linkage application.

MODULES: 1.Multiple Source Data. 2.Statistical Record Linkage Techniques. 3.Detecting Overhead By Matching Tree. 4.Overheads Eliminated.

1.Multiple Source Data : Databases is becoming an increasingly important part of many data mining projects, as often data from multiple sources needs to be matched in order to enrich data or improve its quality. The data needed to support these decisions are often scattered in heterogeneous distributed databases. In such cases, it maybe necessary to link records in multiple databases so that one can consolidate and use the data pertaining to the same realworld entity.

2.Statistical Record Linkage Techniques : Statistical record linkage techniques could be use of such techniques for online record linkage could pose a tremendous communication bottleneck in a distributed environment (where entity heterogeneity problems are often encountered).If the databases use the same set of design standards, this linking can easily be done using the primary key (or other common candidate keys). This is because nonkey attribute values may not match even when the records represent the same entity instance in reality. 3.Detecting Overhead By Matching Tree :

providing matching decisions that are guaranteed to be the same as those obtained using the conventional linkage technique. These techniques have been implemented, and experiments with real-world and synthetic databases show significant reduction in communication overhead. when the matching records reside at a remote site, existing techniques cannot be directly applied because they would involve transferring the entire remote relation, thereby incurring a huge communication overhead

4.Overheads Eliminated : we develop a matching tree, similar to a decision tree, and use it to propose techniques that reduce the communication overhead significantly, while providing matching decisions that are guaranteed to be the same as those obtained using the conventional linkage technique the online record linkage process more efficient by reducing the communication overhead in a distributed environment. To demonstrate that these techniques provide significant savings in communication overhead, we normalize the communication overhead needed by our approach by the size of the remote database from where the matching records need to be extracted

Chapter 03

Chapter 03 System Design:

SDLC METHDOLOGIES
This document play a vital role in the development of life cycle (SDLC) as it describes the complete requirement of the system. It means for use by developers and will be the basic during testing phase. Any changes made to the requirements in the future will have to go through formal change approval process. SPIRAL MODEL was defined by Barry Boehm in his 1988 article, A spiral Model of Software Development and Enhancement. This model was not the first model to discuss iterative development, but it was the first model to explain why the iteration models. As originally envisioned, the iterations were typically 6 months to 2 years long. Each phase starts with a design goal and ends with a client reviewing the progress thus far. Analysis and engineering efforts are applied at each phase of the project, with an eye toward the end goal of the project. The steps for Spiral Model can be generalized as follows: The new system requirements are defined in as much details as possible. This usually involves interviewing a number of users representing all the external or internal users and other aspects of the existing system. A preliminary design is created for the new system. A first prototype of the new system is constructed from the preliminary design. This is usually a scaled-down system, and represents an approximation of the characteristics of the final product. A second prototype is evolved by a fourfold procedure: 1. Evaluating the first prototype in terms of its strengths, weakness, and risks. 2. Defining the requirements of the second prototype. 3. Planning an designing the second prototype. 4. Constructing and testing the second prototype. At the customer option, the entire project can be aborted if the risk is deemed too great. Risk factors might involved development cost overruns, operating-cost miscalculation, or any other factor that could, in the customers judgment, result in a less-than-satisfactory final product.

The existing prototype is evaluated in the same manner as was the previous prototype, and if necessary, another prototype is developed from it according to the fourfold procedure outlined above. The preceding steps are iterated until the customer is satisfied that the refined prototype represents the final product desired. The final system is constructed, based on the refined prototype. The final system is thoroughly evaluated and tested. Routine maintenance is carried on a continuing basis to prevent large scale failures and to minimize down time.

The following diagram shows how a spiral model acts like:

Fig -Spiral Model

ADVANTAGES

Estimates(i.e. budget, schedule etc .) become more relistic as work progresses, because important issues discoved earlier. It is more able to cope with the changes that are software development generally entails. Software engineers can get their hands in and start woring on the core of a project earlier.

APPLICATION DEVELOPMENT

N-TIER APPLICATIONS N-Tier Applications can easily implement the concepts of Distributed Application Design and Architecture. The N-Tier Applications provide strategic benefits to Enterprise Solutions. While 2-tier, client-server can help us create quick and easy solutions and may be used for Rapid Prototyping, they can easily become a maintenance and security night mare The N-tier Applications provide specific advantages that are vital to the business continuity of the enterprise. Typical features of a real life n-tier may include the following: Security Availability and Scalability Manageability Easy Maintenance Data Abstraction

The above mentioned points are some of the key design goals of a successful n-tier application that intends to provide a good Business Solution.

DEFINITION Simply stated, an n-tier application helps us distribute the overall functionality into various tiers or layers: Presentation Layer

Business Rules Layer Data Access Layer Database/Data Store

Each layer can be developed independently of the other provided that it adheres to the standards and communicates with the other layers as per the specifications. This is the one of the biggest advantages of the n-tier application. Each layer can potentially treat the other layer as a Block-Box. In other words, each layer does not care how other layer processes the data as long as it sends the right data in a correct format.

Fig -N-Tier Architecture 1. THE PRESENTATION LAYER Also called as the client layer comprises of components that are dedicated to presenting the data to the user. For example: Windows/Web Forms and buttons, edit boxes, Text boxes, labels, grids, etc. 2. THE BUSINESS RULES LAYER

This layer encapsulates the Business rules or the business logic of the encapsulations. To have a separate layer for business logic is of a great advantage. This is because any changes in Business Rules can be easily handled in this layer. As long as the interface between the layers remains the same, any changes to the functionality/processing logic in this layer can be made without impacting the others. A lot of client-server apps failed to implement successfully as changing the business logic was a painful process.

3. THE DATA ACCESS LAYER This layer comprises of components that help in accessing the Database. If used in the right way, this layer provides a level of abstraction for the database structures. Simply put changes made to the database, tables, etc do not affect the rest of the application because of the Data Access layer. The different application layers send the data requests to this layer and receive the response from this layer. 4. THE DATABASE LAYER This layer comprises of the Database Components such as DB Files, Tables, Views, etc. The Actual database could be created using SQL Server, Oracle, Flat files, etc. In an n-tier application, the entire application can be implemented in such a way that it is independent of the actual Database. For instance, you could change the Database Location with minimal changes to Data Access Layer. The rest of the Application should remain unaffected.

UML:

USE CASE DIAGRAM: We have identified 2 actors in these diagrams, the actual Machine Users and the Unix Developers. The Machine user can; begin using the system this represents whichever method the user will use in order to make initial interaction with the system. For example, they may need to turn the system on via a button, simply turn the key in the ignition or some over method.

LOGIN

REGISTRATION

USER

SELECT

SYSTEM

SEARCH

ACCESS PAGES

ACTIVITY DIAGRAM :

These activity diagrams show how the use-cases interact with the system and interface. The User starts by initially interacting with the system. The main page is then rendered by the system and it is displayed by the interface, which the user can view. From here the user can click on a link, scroll or close the system. If they chose to click a link, the system renders the new page and it is displayed by the interface, which brings the user back to viewing.

START

LOGIN

REGISTRA TION

SELECT

SEARCH

ACCESS PAGES

END

SEQUENCE DIAGRAM:

: USER SYSTEM.

REGISTRATION AUTHENTICATION

STORING DETAILS

LOGIN
VALIDATING

DISPLAYS SEARCH PAGE

SEARCH DETAILS
LINKS THE ENTITY RECORDS

DISPLAYS DETAILS

CLASS DIAGRAMS:

ER DIAGRAM:
E class Schema1 O ONLINEUSERS column FIRSTNAME: VARCHAR2(30 BYT E) LASTNAME: VARCHAR2(30 BYTE) PASSWORD: VARCHAR2(30 BYTE) EMAILID: VARCHAR2(30 BYTE) PHONENO: VARCHAR2(10 BYTE) STREET: VARCHAR2(30 BYTE) CITY: VARCHAR2(30 BYTE)

DATA DICTIONARY: TABLE NAME: ONLINE USERS FIRSTNAME LASTNAME PASSWORD EMAILID PHONENO STREET CITY VARCHAR2(30) VARCHAR2(30) VARCHAR2(30) VARCHAR2(30) VARCHAR2(10) VARCHAR2(30) VARCHAR2(30)

DATA FLOW DIAGRAM : A DFD usually comprises of four components. These four components can be represented by four simple symbols. These symbols can be explained in detail as follows: External entities (source/destination of data) are represented by squares; Processes (input-processing-output) are represented by rectangles with rounded corners; Data Flows (physical or electronic data) are represented by arrows; and finally, Data Stores (physical or electronic like XML files) are represented by open-ended rectangles.

Sender

Multiple source data

Database
Data link in common key

Matching overhead

Overhead eliminated

Receiver

Process Specification (Techniques And Algorithm Used


Efficient techniques using matching tree

Chapter 04:

Screen Shots:

Chapter 05 Technology Description JAVA Java was designed to meet all the real world requirements with its key features, which are explained in the following paragraph. SIMPLE AND POWERFUL Java was designed to be easy for the professional programmer to learn and use efficiently. Java makes itself simple by not having surprising features. Since it exposes the inner working of a machine, the programmer can perform his desired actions without fear. Unlike other programming systems that provide dozens of complicated ways to perform a simple task, Java provides a small number of clear ways to achieve a given task. SECURE Today everyone is worried about safety and security. People feel that conducting commerce over the Internet is a safe as printing the credit card number on the first page of a Newspaper. Threatening of viruses and system hackers also exists. To overcome all these fears java has safety and security as its key design principle. Using Java Compatible Browser, anyone can safely download java applets without the fear of viral infection or malicious intent. Java achieves this protection by confining a java program to the java execution environment and by making it inaccessible to other parts of the computer. We can download applets with confidence that no harm will be done and no security will be breached.

PORTABLE In java, the same mechanism that gives security also helps in portability. Many types of computers and operating systems are in use throughout the world and are connected to the internet. For downloading programs through different platforms connected to the internet, some portable, executable code is needed. Javas answer to these problems is its well designed architecture. OBJECT-ORIENTED Java was designed to be source-code compatible with any other language. Java team gave a clean, usable, realistic approach to objects. The object model in java is simple and easy to extend, while simple types, such as integers, are kept as high-performance non -objects. DYNAMIC Java programs carry with them extensive amounts of run-time information that is used to verify and resolve accesses to objects at run-time. Using this concept it is possible to dynamically link code. Dynamic property of java adds strength to the applet environment, in which small fragments of byte code may be dynamically updated on a running system. NEWLY ADDDED FEATURES IN JAVA 2 SWING is a set of user interface components that is entirely implemented in java the user can use a look and feel that is either specific to a particular operating system or uniform across operating systems. Collections are a group of objects. Java provides several types of collection, such as linked lists, dynamic arrays, and hash tables, for our use. Collections offer a new way to solve several common programming problems.

Various tools such as javac, java and javadoc have been enhanced. Debugger and profiler interfaces for the JVM are available. Performance improvements have been made in several areas. A JUST-INTIME (JIT) compiler is included in the JDK. Digital certificates provide a mechanism to establish the identity of a user, which can be referred as electronic passports. Various security tools are available that enable the user to create the user to create and store cryptographic keys ad digital certificates, sign Java Archive(JAR) files, and check the signature of a JAR file. SWING Swing components facilitate efficient graphical user interface (GUI) development. These components are a collection of lightweight visual components. Swing components contain a replacement for the heavyweight AWT components as well as complex user interface components such as Trees and Tables. Swing components contain a pluggable look and feel (PL & F). This allows all applications to run with the native look and feel on different platforms. PL & F allows applications to have the same behaviour on various platforms. JFC contains operating system neutral look and feel. Swing components do not contain peers. Swing components allow mixing AWT heavyweight and Swing lightweight components in an application. The major difference between lightweight and heavyweight components is that lightweight components can have transparent pixels while heavyweight components are always opaque. Lightweight components can be non-rectangular while heavyweight components are always rectangular.

Swing

components

are

JavaBean

compliant.

This

allows

components to be used easily in a Bean aware application building program. The root of the majority of the Swing hierarchy is the JComponent class. This class is an extension of the AWT Container class. Swing components comprise of a large percentage of the JFC release. The Swing component toolkit consists of over 250 pure Java classes and 75 Interfaces contained in about 10 Packages. They are used to build lightweight user interfaces. Swing consists of User Interface (UI) classes and non- User Interface classes. The non-User Interface classes provide services and other operations for the UI classes. Swing offers a number of advantages, which include Wide variety of Components Pluggable Look and Feel MVC Architecture Keystroke Handling Action Objects Nested Containers Virtual Desktops Compound Borders Customized Dialogues Standard Dialog Classes Structured Table and Tree Components Powerful Text Manipulation Generic Undo Capabilities Accessibility Support

JAVA DATABASE CONNECTIVITY (JDBC) JDBC AND ODBC IN JAVA: Most popular and widely accepted database connectivity called Open Database Connectivity (ODBC) is used to access the relational databases. It offers the ability to connect to almost all the databases on almost all platforms. Java applications can also use this ODBC to communicate with a database. Then we need JDBC why? There are several reasons: ODBC API was completely written in C language and it makes an extensive use of pointers. Calls from Java to native C code have a number of drawbacks in the security, implementation, robustness and automatic portability of applications. ODBC is hard to learn. It mixes simple and advanced features together, and it has complex options even for simple queries. ODBC drivers must be installed on clients machine. Structured Query Language (SQL): SQL (Pronounced Sequel) is the programming language that defines and manipulates the database. SQL databases are relational databases; this means simply the data is store in a set of simple relations. A database can have one or more table. You can define and manipulate data in a table with SQL commands. You use the data definition language (DDL) commands to creating and altering databases and tables. You can update, delete or retrieve data in a table with data manipulation commands (DML). DML commands include commands to alter and fetch data. The most common SQL commands include commands is the SELECT command, which allows you to retrieve data from the database.

In addition to SQL commands, the oracle server has a procedural language called PL/SQL. PL/SQL enables the programmer to program SQL statement. It allows you to control the flow of a SQL program, to use variables, and to write error-handling procedures.

Coding: Search.jsp <%@ page import="java.util.*"%> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <meta http-equiv="content-type" content="text/html; charset=utf-8" /> <title>Online Record Linkage</title> <meta name="keywords" content="" /> <meta name="description" content="" /> <link href="default.css" rel="stylesheet" type="text/css" /> <script type="text/javascript"> <% String name=(String)session.getAttribute("UserName"); System.out.println("Hi"+name); %> </script> </head> <body class="bg1"> <table width="100%" border="0" cellspacing="0" cellpadding="0"> <tr> <td class="bg2">

<table width="760" border="0" align="center" cellpadding="0" cellspacing="0"> <tr> <td> <span class="text1">Online</span><span class="text2">Record</span><span class="text2">Linkage</span><br /><br /> </td> <td> <p><span class="text3">Hi!. <%=name%></span></p> </td> </tr> </table> </td> </tr> <tr> <td class="bg3"> <table width="760" border="0" align="center" cellpadding="0" cellspacing="0"> <tr align="center"> <td><a href="search.jsp" class="link1">Home</a></td> <td><img src="images/img3.jpg" alt="" width="8" height="55" /></td> <td><a href="Services.jsp" class="link1">Services</a></td>

<td><img src="images/img3.jpg" alt="" width="8" height="55" /></td> <!--<td><a href="#" class="link1">Bookmarks</a></td> <td><img src="images/img3.jpg" alt="" width="8" height="55" /></td> <td><a href="#" class="link1">Recent Pages</a></td> <td><img src="images/img3.jpg" alt="" width="8" height="55" /></td>--> <td><a href="About.jsp" class="link1">About Us</a> </td> <td><img src="images/img3.jpg" alt="" width="8" height="55" /></td> </tr> </table> </td> </tr> <tr> <td class="bg4"> <table width="700" border="0" align="left" cellpadding="0" cellspacing="0"> <!-- <tr valign="top"><td><img src="images/index.jpeg" align="left" alt="" width="280" height="445" /></td> --> <tr> <td> <form method="post" action="Validate.jsp">

<div id="wb_Text1" style="position:absolute;left:342px;top:330px;width:74px;height:33px;zindex:0;" align="left"> <font style="font-size:50px" color="#000000" face="Brush Script MT">Search</font></div> <input type="text" style="position:absolute;left:500px;top:320px;width:500px;height:33px;fo nt-family:CourierNew;font-size:20px;z-index:1" name="bin"></input> <input type="submit" value="Search" style="position:absolute;left:570px;top:380px;width:140px;height:40px;fo nt-family:Arial;font-size:17px;z-index:2"></input> </form> <!-- <form name="f4" method="post" action="RankChart.jsp"> <input type="submit" name="Page Rank" value="Page Rank" style="position:absolute;left:800px;top:380px;width:140px;height:40px;fo nt-family:Arial;font-size:17px;z-index:3"></input> --> </form> </td> </tr> </table> </td> </tr> </table> <table> <tr>

<td> <table width="700" border="0" align="center" cellpadding="0" cellspacing="0"> <tr> <td class="text5">&nbsp;</td> <td align="right" class="text5">&nbsp;</td> </tr> </table> </td> </tr> </table> </body> </html>

How to Install Document Open source code folder Open code folder Copy EfficientTechniques folder Paste EfficientTechniques folder in Tomcat Webapps folder Run Tomcat server Open Internet Explorer Browser and type path/url See video in how to run video foder Chapter 6 Testing Black Box testing Black Box testing is also known as functional testing. A software testing technique whereby the internal workings of the item being tested are not known by the tester. For example, in a black box test on a software design the tester only knows the inputs and what the expected outcomes should be and not how the program arrives at those outputs. The tester does not ever examine the programming code and does not need any further knowledge of the program other than its specifications. White box testing White box testing also known as glass box, structural, clear box and open box testing. A software testing technique whereby explicit knowledge of the internal workings of the item being tested are used to select the test data. Unlike black box testing, white box testing uses specific knowledge of programming code to examine outputs. The test is accurate only if the tester knows what the program is supposed to do. He or she can then see if the

program diverges from its intended goal. White box testing does not account for errors caused by omission, and all visible code must also be readable. Unit testing Unit testing is a software development process in which the smallest testable parts of an application, called units, are individually and independently scrutinized for proper operation. Unit testing is often automated but it can also be done manually. Unit testing involves only those characteristics that are vital to the performance of the unit under test. This encourages developers to modify the source code without immediate concerns about how such changes might affect the functioning of other units or the program as a whole. Once all of the units in a program have been found to be working in the most efficient and error-free manner possible, larger components of the program can be evaluated by means of integration testing. System testing System testing is the process of performing a variety of tests on a system to explore functionality or and to identify problems. after a system System testing is put is usually required before in place.

A series of systematic procedures are referred to while testing is being performed. These procedures tell the tester how the system should perform and where common mistakes may be found. Testers usually try to "break the system" by entering data that may cause the system to malfunction or return incorrect information. For example, a tester may put in a city in a search engine designed to only accept states, to see how the system will respond to the incorrect input. Integration testing Integration testing is also known as integration and testing, is a software development process which program units are combined and tested

as groups in multiple ways. In this context, a unit is defined as the smallest testable part of an application. Integration testing can expose problems with the interfaces among program components before trouble occurs in real-world program execution. There are two major ways of carrying out an integration test, called the bottom-up method and the top-down method. Bottom-up integration testing begins with unit testing, followed by tests of progressively higher-level combinations of units called modules or builds. In top-down integration testing, the highest-level modules are tested first and progressively lower-level modules are tested after that. Test Case Table Chapter 7

Conclusion: In this paper, we develop efficient techniques to facilitate record linkage decisions in a distributed, online setting. Record linkage is an important issue in heterogeneous database systems where the records representing the same real-world entity type are identified using different identifiers in different databases. In the absence of a common identifier, it is often difficult to find records in a remote database that are similar to a local enquiry record. Traditional record linkage uses a probability-based model to identify the closeness between records. The matching probability is computed based on common attribute values. This, of course, requires that common attribute values of all the remote records be transferred to the local site. The communication overhead is significantly large for such an operation. We propose techniques for record linkage that draw upon previous work in sequential decision making. More specifically, we develop a matching tree for attribute acquisition and propose three different schemes of using this tree for record linkage.

Limitation and Future Enhancements: Schema integration is not the focus of this paper

References: [1] J.A. Baldwin, Linked Record Health Data Systems, The Statistician, vol. 21, no. 4, pp. 325-338, 1972. [2] C. Batini, M. Lenzerini, and S.B. Navathe, A Comparative Analysis of Methodologies for Database Schema Integration,ACM Computing Surveys, vol. 18, no. 4, pp. 323-364, 1986 [3] R. Baxter, P. Christen, and T. Churches, A Comparison of Fast Blocking Methods for Record Linkage, Proc. ACM Workshop Data Cleaning, Record Linkage and Object Consolidation, pp. 25-27, Aug.2003. [4] T.R. Belin and D.B. Rubin, A Method for Calibrating False-Match Rates in Record Linkage, J. Am. Statistical Assoc., vol. 90, no. 430,pp. 694-707, 1995. [5] P. Bernstein, Applying Model Management to Classical Meta Data Problems, Proc. Conf. Innovative Database Research (CIDR),pp. 209-220, Jan. 2003. [6] J. Bischoff and T. Alexander, Data Warehouse: Practical Advice from the Experts. Prentice-Hall, 1997 [7] J.B. Copas and F.J. Hilton, Record Linkage: Statistical Models for Matching Computer Records, J. Royal Statistical Soc., vol. 153,no. 3, pp. 287320, 1990. [8] D. Dey, Record Matching in Data Warehouses: A Decision Model for Data Consolidation, Operations Research, vol. 51, no. 2, pp. 240-254, 2003. [9] D. Dey, S. Sarkar, and P. De, A Distance-Based Approach to Entity Reconciliation in Heterogeneous Databases, IEEE Trans.Knowledge and Data Eng., vol. 14, no. 3, pp. 567-582, May/June 2002. [10] A.K. Elmagarmid, P.G. Ipeirotis, and V.S. Verykios, Duplicate Record Detection: A Survey, IEEE Trans. Knowledge and Data Eng.,vol. 19, no. 1, pp. 1-16, Jan. 2007.

[11] R.J. Miller, Y.E. Ioannidis, and R. Ramakrishnan, Schema Equivalence in Heterogeneous Systems: Bridging Theory and Practice, Information Systems, vol. 19, no. 1, pp. 3-31, 1994. [12] B. Tepping, A Model for Optimum Linkage of Records, J. Am. Statistical Assoc., vol. 63, pp. 1321-1332, 1968.

Вам также может понравиться