0 оценок0% нашли этот документ полезным (0 голосов)
24 просмотров8 страниц
Data de-duplication is the process of identifying matching records from a variety of data sets and
then merging these together to leave one best fit record remaining, or a record that takes the
best fit fields – providing the “golden record”. It forms the cornerstone of ensuring data quality,
especially when you are reviewing your MDM requirements.
Data de-duplication is the process of identifying matching records from a variety of data sets and
then merging these together to leave one best fit record remaining, or a record that takes the
best fit fields – providing the “golden record”. It forms the cornerstone of ensuring data quality,
especially when you are reviewing your MDM requirements.
Data de-duplication is the process of identifying matching records from a variety of data sets and
then merging these together to leave one best fit record remaining, or a record that takes the
best fit fields – providing the “golden record”. It forms the cornerstone of ensuring data quality,
especially when you are reviewing your MDM requirements.
Providing a cornerstone to your Master Data Management (MDM) strategy
Understanding Data De-duplication Transoft White Paper Confidential & proprietary (i) Contents Introduction .................................................................................................................................1 Example of de-duplication ..........................................................................................................2 Successful de-duplication ..........................................................................................................3 Normalization....................................................................................................................................... 3 Grouping.............................................................................................................................................. 4 Matching .............................................................................................................................................. 4 Merging ............................................................................................................................................... 5 How Transoft DBIntegrate can help...................................................................................................... 5 Transoft the systems transformation company .....................................................................6
Understanding Data De-duplication
Transoft White Paper Confidential & proprietary
1 Introduction Data de-duplication is the process of identifying matching records from a variety of data sets and then merging these together to leave one best fit record remaining, or a record that takes the best fit fields providing the golden record. It forms the cornerstone of ensuring data quality, especially when you are reviewing your MDM requirements. The Process allows a user to match data together without a unique common identifier, such as customer ID number, and instead base a match on key information fields such as surnames, company names or addresses. The typical reasons duplicated data is created are: Lack of processes, such as not checking historical or archive records to see if they can be re-opened Inconsistent standards for formatting and abbreviations, such as using nicknames or substituting words for shorthand text e.g. Dr for doctor Poor data validation, particularly when it comes to addresses Staff taking shortcuts, where it is quicker to set up new records than find the original System integration requirements to avoid overriding old data, where whole new records are created, such as a worker switching from a PAYE to LTD pay scheme Poor training where a user cant properly search databases.
Understanding Data De-duplication
Transoft White Paper Confidential & proprietary
2 Example of de-duplication The table below shows example data that requires de-duplication where long and short versions of a persons name have been used and inconsistencies are apparent within address data.
ID Title Forename Surname Addr1 Addr2 Town Postcode Pay status 210276 Mrs Gillian Rhodes 10 Rogers Lane Langley Slough SL1 4GH PAYE 102356 Ms Gill Rhodes 10 Rogers Lane Slough SL1 4HH LTD 103556 Miss Gillian Mary Rhodes 1 Rogers Lane Slough LTD 103450 Mr Matt Turner 79 Stapleton Rd Reading RG1 MX7 PAYE 204576 Mr Matthew Tuner 79 Stapleton Road Reading RG1 PAYE
A common use for data de-duplication is the identification of replicated postal and email addresses to remove redundancy from mail shots. For example, a marketing team has two lists of email addresses from two data sources, one of which they have collated and the other which has been bought from another organization this leads to a very high potential for duplicates. As company perceptions and customer satisfaction can be damaged by small issues such as a customer receiving two emails, it is important that the data is correct and valid. From a physical mail perspective, sending two letters or even two catalogues to the same address can have significant cost implications. From a process perspective, a marketing team would ideally be able to import and run a de-duplication project so they are left with one de-duplicated list of addresses. This would require having a technical user to setup the project and validate that it is finding matches and not erroneously pairing data. Once this has been done, the project could be rerun multiple times using different data sets without any further input required by the marketing team.
Understanding Data De-duplication
Transoft White Paper Confidential & proprietary
3 Successful de-duplication Some tips to ensure successful de-duplication: 1. Identify business rules surrounding your data, look at how you want your data formatted, e.g. do you want all customer surnames to be in uppercase? 2. Analyse data prior to performing de-duplication to understand key relationships to be used to match records, e.g. customer activity tables or bank account details. 3. Understand what data should be ranked when matching a set of records, ask yourself questions when identifying data importance, does it matter for example if one record contains a customer middle name whilst another does not? 4. Access reference data such as the Royal Mail PAF database for validating UK addresses. Other reference examples can be deceased, gone away or do-not-contact lists. 5. Carry out preliminary analysis on a representative data sample to help determine business rules. There is no silver bullet for successful de-duplication jobs; it is a case of identifying the situation and rules relevant to each set of data to help find a new master record. However, using the correct software tool and process for defining de-duplication projects is the easiest way to break the process into the following manageable steps:
Normalization Normalization helps reduce small data entry issues, and puts data into a consistent format. This greatly enhances the chance of successful and useful de-duplicated data. This stage can be made up of steps of mapping jobs in order to bring data from a variety of data types together, for example Microsoft Excel spreadsheets can be brought in to be de-duplicated alongside a SQL Server RDBMS and a legacy CRM application. Data should also be formatted so that it is the same data type, for example, converting a string 01/01/2011 so that it is a date timestamp would be beneficial to de-duplication. Other common normalization practices include: Setting title case Removing unessential punctuation Widening acronyms or abbreviations e.g. replacing st with street or rd with road If data contains a wide range of issues that need to be assessed and resolved, such as address validation or removing invalid data, it would be appropriate to clean data prior to performing de-duplication. Without cleansed data, the efficiency of matches will be reduced and erroneous records may be paired together in the latter stages of de-duplication. Normalization in general should only be used to ensure that data within separate records can be matched and not fix large scale dirty data issues. Understanding Data De-duplication
Transoft White Paper Confidential & proprietary
4 Grouping Running de-duplication on large data sets can take considerable lengths of time; therefore grouping is essential to divide the rows to be matched and merged into manageable data sets. In general, the smaller the groups are, the more efficient the matching and merging process will be. Grouping data also helps avoid spurious matches between sets, for example grouping by gender would avoid matches between male and female records where first names are unisex such as Alex, Robin and Sam. Grouping is commonly an optional step in de-duplication, users simply have to bypass the step to avoid grouping by fields. But it is only recommended for small data sets (<10,000 rows) to avoid slowing down the runtime of projects.
Matching In the matching stage, rules are defined to build a match score for any pair of records to determine whether or not they should be considered duplicates of each other. It is common for fuzzy matching logic to be applied to data to identify duplicate data sets. These functions help overlook poor data entry, such as spelling mistakes. Looking at our example of email addresses again, a common mistake is where the email @ symbol has been mistakenly substituted as an apostrophe, tilde or any keys surrounding the @ symbol on a traditional QWERTY keyboard (~#). If this has not been cleaned up in the normalization phase, then matching functions can overlook these issues. Some common and easily transferable matching methodologies are detailed below: Function Description Ignore Order Matches can be made by ignoring the order of words within a string. E.g. John Smith would match Smith John Soundex Matches can be made on how alike records sound to each other. E.g. New York would match New Yerk Hamming Identifies the number of steps required to change one string into another, a user can define the maximum distance apart the strings can be before they fail as a match. Given a max hamming distance of 3, Matt would match Matthew Substring Match on elements within a string,e.g. to match on the first few digits of a postcode such as SL2 3TY could match SL2 4PU if you wanted to target mail shots to a specific area
Any data record has a wide number of differentials that would need to be accounted for, so a variety of fields must be included in matching. For example, in the scenario where a Father and Son are named after each other, age or date of birth data would be required to avoid this pair matching. Given this, user discretion is required when defining match rules, and periodic reviews of the data matches are suggested to avoid false negatives from appearing as data matches.
Understanding Data De-duplication
Transoft White Paper Confidential & proprietary
5 Merging The final stage of data de-duplicating is to merge records together either by choosing the best fit fields from matched source records or by selecting the best fit record from the matched records. This allows a golden record to be achieved in the target data source. A common approach is to specify fields or text as a higher preference to the merge score; this can be done by looking for better data in matching records, for example: No null or empty fields More recently accessed records, looking at timestamps Fields with more complete information such as a longer address line Fields with more higher numeric values, for example gift aid donations amount to more in one record than another Data with more attachments to other fields in the database, for example a record contains an external reference number whilst another does not
How Transoft DBIntegrate can help Transoft has an experienced consultancy team available to carry out professional data cleansing and de-duplication services. Our team has run hundreds of projects across many different industries, so we can guide, advise or run your data quality projects depending on your requirements. We use Transoft DBIntegrate to make de-duplication a seamless and repeatable project, and offer a wide range of support to our customers. DBIntegrate has a unique user interface to make each stage of the de-duplication process as clear as possible using drop-down boxes, icons and tick boxes so that minimal training is needed. Transoft DBIntegrate also provides real-time access to multiple data sources, enabling flexible, repeatable and automated data migration and cleansing. It offers data integration with optimized real-time read/write access across all data sources, and fast, configurable data warehousing. For more information please visit: www.transoft.com.
Understanding Data De-duplication
Transoft White Paper Confidential & proprietary
6 Transoft the systems transformation company Transoft is a leading provider of innovative and pioneering transformation solutions, with hundreds of thousands of organizations worldwide using our products and services. Our aim is to enable our customers to increase business value and maintain competitive advantage by maximizing the potential of existing data and applications.
This provides rapid return on investment, reduced costs, improved productivity and efficiency, and the ability to manage operational risk. With 25 years experience, and expert staff dedicated to servicing the needs of organizations with legacy systems, we pride ourselves on a tailored approach to customer service.
Major organizations such as The Gap, LOreal, Boeing, Christies and Balfour Beatty have enjoyed the business benefits of a Transoft application transformation strategy. We work with a large network of VARs, System Integrators, ISVs and technical partners to offer unparalleled solutions.
Transoft is a trading name of Transoft Group Limited and Transoft Inc, which are a part of the CSH Group of companies. Transoft is a trade mark. Transoft Group Limited and Transoft Inc 2012. All rights reserved.