Вы находитесь на странице: 1из 3

Ab Initio and Key-Based Operations

Remediator (IT Consultant) posted 10/22/2005 | Comments (1)

Key-based operation - Assign Keys - for surrogate key generation and long-term maintenance. Most folks approach the Assign Keys component as a one-time necessary evil - but its really meant to be used for long-term maintenance of relatonships between natural keys and their assigned surrogates. So the real trick in using Assign Keys is to operationally maintain a master key file for each natural/surrogate pair. For example, if you have five downstream target tables, you'll need five master/surrogate files. Each file will be used for input to assign new surrogate keys or mate existing natural keys with their original surrogate Keep in mind that the highest risk for manufacturing surrogate keys is in losing track of the relationships and requiring a complete rebuild. The surrogates tend to be leveraged by other tables (ideally) so losing the cross-reference between natural and surrogate (due to a system or database crash for example) can be devastating. Assign Keys coupled with a master key file strategy can be powerful assets in developing highperformance data models. Some people (some whom I've worked with and for) say that surrogate keys are demons sent from hell to torment us. Those people stayed away from surrogates almost like a religion - but its a religion based on fear and ignorance. One person told me "I've never, ever used surrogate keys." Which sounded a lot like someone saying "I've never, ever driven a car." or perhaps "I've never, ever been in a swimming pool." In short. don't brag about sidestepping something that everyone else is doing successfully - it makes you look like an amateur. Look at it like this - an integer key (used as a an index) is in many cases over 100 times faster for both lookup and multidimensional queries than its natural (usually character-based) equivalent. put it to the test - build a table of 100,000 rows with an integer and string version of the same columnar data (call them intKey and strKey for example). Use them as foreign keys in another table containing 10,000 rows. Now perform some joins, summaries and order-by operations on the tables using one or the other key. The difference in performance is dramatic even for tables this small. Surrogate keys provide a functional role (to make your data models cleaner and more elegant) and performance boost. A trap exists in partitioning - consider the following output DML for a Reformat: record string(10) strkey;

integer(4) intkey; decimal(10) deckey; end

and the following input format for the Reformat:

record string(1) strDummy; end Now consider the following transform for that same reformat: out::reformat(in) = begin let integer(4) nseq =next_in_sequence(); out.intkey :: nseq; out.deckey :: nseq; out.strkey :: decimal_lpad((decimal(10))nseq,10); end; Put a Generate Records before the Reformat, set its record count to 1000, its layout to a serial file, and this will effectively output the following sequence:

Record 1: [record strkey "0000000001" intkey 1 deckey " 1"] Record 2: [record strkey "0000000002" intkey 2 deckey " 2"] Record 3: [record strkey "0000000003" intkey 3 deckey " 3"]

now run this Reformat's output into a Replicate, then through three separate Partition by Keys that each feed a Trash component. make each of the Trash Components use a 4-way multifile make sure each of the Partition-by-Key's is propagating from Generate Records (serially) , not Trash. set one of the Partition by Keys to partition on strKey, one on intKey, one on deckey Run it and when it completes, open the tracking detail for each of the separate Trash components by peforming a right-click on the Trash and select Tracking Detail you will get the following record distribution for partitions 0-3 - your actual mileage may vary: strKey - 246, 254, 247, 253 intkey - 238, 266, 233, 263 deckey - 274, 227, 273, 226

clearly the records are distributed differently based on their types, even though they superficially represent the same value - this means downstream partitioned key-based operations WILL GIVE WRONG RESULTS This parimarily affects multiple-branch inputs - because in a single input flow this problem is not as dramatic - the issue arises when you are trying to merge separate inbound flows with differing data types. Conclusion: convert all of your data to common graph-facing types before performing a Partition-By-Key operation - which means - to make things completely transparent to your graph - convert everything to common graph-centric types when the data arrives so that all of the downstream components will behave consistently. Assuming that a value that is a decimal(7) will Join with a value that is an integer(4) requires some overrides, but even if the two data points were cast correctly in the Join - if they are partitioned on their original types they will mismatch in the partitions (your record may be in partition 0 while its mate is over in partition 1 - meaning that they will never join and give you a wrong answer. Likewise with a Rollup and Sort - you can assume that the Rollup and Sort will transparently use the key you have specified - but if the keys are spread across the wrong partitions you won't get the right answer.

Вам также может понравиться