Вы находитесь на странице: 1из 7

ETL Startegy to store data validation rules

Every time there is movement of data the results have to be tested against the
expected results. For every ETL process, test conditions for testing data are defined
before/during design and development phase itself. Some that are missed can be
added later on.
Various test conditions are used to validate data when the ETL process is
migrated from DEV-to->QA-to->PRD. These test conditions are can exists in the
developers/testers mind /documented in word or excel. With time the test
conditions either lost ignored or scattered all around to be really useful.
In production if the ETL process runs successfully without error is a good thing. But
it does not really mean anything. You still need rules to validate data processed
by ETL. At this point you need data validation rules again!
A better ETL strategy is to store the ETL business rules in a RULES table by target
table, source system. These rules can be in SQL text. This will create a repository of
all the rules in a single location which can be called by any ETL process/ auditor at
any phase of the project life cycle.
There is also no need to re-write /rethink rules. Any or all of these rules can be
made optional, tolerances can be defined, called immediately after the process is
run or data can be audited at leisure.
This Data validation /auditing system will basically contain
A table that contains the rules,
A process to call is dynamically and
A table to store the results from the execution of the rules
Benefits:
Rules can be added dynamically with no cange to code.
Rules are stored permanantly.
Tolerance level can be changed with ever changing the code
Biz rules can be added or validated by business experts without worring about the
ETL code.
Data Warehouse Testing
Businesses are increasingly focusing on the collection and organization of data for strategic
decision-making. The ability to review historical trends and monitor near real-time
operational data has become a key competitive advantage.
We provide practical recommendations for testing extract, transform and load (ETL)
applications based on years of experience testing data warehouses in the financial services
and consumer retailing areas.
There is an exponentially increasing cost associated with finding software defects later in
the development lifecycle. In data warehousing, this is compounded because of the
additional business costs of using incorrect data to make critical business decisions. Given
the importance of early detection of software defects, here are some general goals of testing
an ETL application:
Data completeness. Ensures that all expected data is loaded.

Data transformation. Ensures that all data is transformed correctly according to


business rules and/or design specifications.

Data quality. Ensures that the ETL application correctly rejects, substitutes default
values, corrects or ignores and reports invalid data.

Performance and scalability. Ensures that data loads and queries perform within
expected time frames and that the technical architecture is scalable.

Integration testing. Ensures that the ETL process functions well with other upstream
and downstream processes.

User-acceptance testing. Ensures the solution meets users current expectations and
anticipates their future expectations

Regression testing. Ensures existing functionality remains intact each time a new
release of code is completed

Data Completeness
One of the most basic tests of data completeness is to verify that all expected data loads into
the data warehouse. This includes validating that all records, all fields and the full contents
of each field are loaded. Strategies to consider include:
Comparing record counts between source data, data loaded to the warehouse and
rejected records.

Comparing unique values of key fields between source data and data loaded to the
warehouse. This is a valuable technique that points out a variety of possible data errors
without doing a full validation on all fields.

Utilizing a data profiling tool that shows the range and value distributions of fields in
a data set. This can be used during testing and in production to compare source and target
data sets and point out any data anomalies from source systems that may be missed even
when the data movement is correct.

Populating the full contents of each field to validate that no truncation occurs at any
step in the process. For example, if the source data field is a string(30) make sure to test it
with 30 characters.

Testing the boundaries of each field to find any database limitations. For example,
for a decimal(3) field include values of -99 and 999, and for date fields include the entire
range of dates expected. Depending on the type of database and how it is indexed, it is
possible that the range of values the database accepts is too small.
Data Transformation
Validating that data is transformed correctly based on business rules can be the most
complex part of testing an ETL application with significant transformation logic. One typical
method is to pick some sample records and stare and compare to validate data
transformations manually. This can be useful but requires manual testing steps and testers
who understand the ETL logic. A combination of automated data profiling and automated
data movement validations is a better long-term strategy. Here are some simple automated
data movement techniques:

Create a spreadsheet of scenarios of input data and expected results and validate
these with the business customer. This is a good requirements elicitation exercise during
design and can also be used during testing.

Create test data that includes all scenarios. Elicit the help of an ETL developer to
automate the process of populating data sets with the scenario spreadsheet to allow for
flexibility because scenarios will change.

Utilize data profiling results to compare range and distribution of values in each field
between source and target data.

Validate correct processing of ETL-generated fields such as surrogate keys.

Validate that data types in the warehouse are as specified in the design and/or the
data model.

Set up data scenarios that test referential integrity between tables. For example, what
happens when the data contains foreign key values not in the parent table?

Validate parent-to-child relationships in the data. Set up data scenarios that test how
orphaned child records are handled.

Data Quality
For the purposes of this discussion, data quality is defined as how the ETL system handles
data rejection, substitution, correction and notification without modifying data. To ensure
success in testing data quality, include as many data scenarios as possible. Typically, data
quality rules are defined during design, for example:
Reject the record if a certain decimal field has nonnumeric data.

Substitute null if a certain decimal field has nonnumeric data.

Validate and correct the state field if necessary based on the ZIP code.

Compare product code to values in a lookup table, and if there is no match load
anyway but report to users.
Depending on the data quality rules of the application being tested, scenarios to test might
include null key values, duplicate records in source data and invalid data types in fields (e.g.,
alphabetic characters in a decimal field). Review the detailed test scenarios with business
users and technical designers to ensure that all are on the same page. Data quality rules
applied to the data will usually be invisible to the users once the application is in
production; users will only see whats loaded to the database. For this reason, it is important
to ensure that what is done with invalid data is reported to the users. These data quality
reports present valuable data that sometimes reveals systematic issues with source data. In
some cases, it may be beneficial to populate the before data in the database for users to
view.
Performance and Scalability
As the volume of data in a data warehouse grows, ETL load times can be expected to
increase, and performance of queries can be expected to degrade. This can be mitigated by

having a solid technical architecture and good ETL design. The aim of the performance
testing is to point out any potential weaknesses in the ETL design, such as reading a file
multiple times or creating unnecessary intermediate files. The following strategies will help
discover performance issues:
Load the database with peak expected production volumes to ensure that this volume
of data can be loaded by the ETL process within the agreed-upon window.

Compare these ETL loading times to loads performed with a smaller amount of data
to anticipate scalability issues. Compare the ETL processing times component by
component to point out any areas of weakness.

Monitor the timing of the reject process and consider how large volumes of rejected
data will be handled.

Perform simple and multiple join queries to validate query performance on large
database volumes. Work with business users to develop sample queries and acceptable
performance criteria for each query.
Integration Testing
Typically, system testing only includes testing within the ETL application. The endpoints for
system testing are the input and output of the ETL code being tested. Integration testing
shows how the application fits into the overall flow of all upstream and downstream
applications. When creating integration test scenarios, consider how the overall process can
break and focus on touch points between applications rather than within one application.
Consider how process failures at each step would be handled and how data would be
recovered or deleted if necessary.
Most issues found during integration testing are either data related to or resulting from false
assumptions about the design of another application. Therefore, it is important to
integration test with production-like data. Real production data is ideal, but depending on
the contents of the data, there could be privacy or security concerns that require certain
fields to be randomized before using it in a test environment. As always, dont forget the
importance of good communication between the testing and design teams of all systems
involved. To help bridge this communication gap, gather team members from all systems
together to formulate test scenarios and discuss what could go wrong in production. Run the
overall process from end to end in the same order and with the same dependencies as in
production. Integration testing should be a combined effort and not the responsibility solely
of the team testing the ETL application
greek143
11 Answers
Member Since Aug-2010
Flatfiles Validations
When we are extracting the flatfiles, What are the basic required validations?
Ans. Flatfiles Validations

Folllowing are some common validations performed:


a) Check for blank lines and remove them.
b) Check the number of column in each row of the file.
c) If there is a trailer line in the flat file containing additional information like total number
of records,then a cross check is performed to check if the number of records specified in the
trailer and the actual number of records are same.
d) Check if a column contains balnk value (If it is expected to have values).
Data Validations
It depends upon the requirment you needed.
Some bascic checks:
1. NULL validation
2. Data type validation
if you consider Data quality below points may come across
1. Address fileld validations
2. Word validations
3. Character validations
What is Requirements Traceability?
Requirements traceability is defined as the ability to describe and follow the life of a
requirement, in both a forward and backward direction (i.e., from its origins, through its
development and specification, to its subsequent deployment and use, and through periods
of ongoing refinement and iteration in any of these phases)
Traceability ensures completeness, that all lower level requirements come from higher level
requirements, and that all higher level requirements are allocated to lower level
requirements.
Traceability is also used in managing change and provides the basis for test
planning.
Benefits:
To identify the extent to which the business requirements have been covered by
functional and system requirements.
To identify the orphan functional and system requirements. This would indicate a
missing trace between requirements
To identify the extent to which system requirements are covered from a design
perspective.
To identify the functional coverage of the QA test scenarios.

To identify which design components implement a requirement.


To identify the test scenarios that will be used to verify a requirement
To analyze the impact of changing requirements on the software artifacts created in
subsequent phases of the SDLC
For Any given project, three important questions that need to be answered
before embarking on any particular requirements traceability approach are :
What needs to be traced ?
What type of linkages need to be made?
How and when and who should establish and maintain the links
What needs to be traced :
Application Components
Business Requirements
Functional Requirements
System Requirements
Design Artifacts
Testing Artifacts
Type of Links:
Forward, Backward links between requirements
Vertical links between requirements and other artifacts
Internal /External Links
Who, How and When?
Project Manager, Business Analyst, Development Lead?
Through Tools or through document linking and references
Stages in SDLC with well defined entry and exit criteria for defining the links
Link requirements to external documents/ URLs to enhance requirement description.
Link requirements across projects
Get a full view of how requirements are related to each other through Matrix view or
Tree view
Get full description of the linked requirements through click of a button
Prevent unpleasant surprise through real time alerts when requirements change.
Traces are automatically marked suspect when requirements change
Get full description of the change by comparing versions of the requirements

Generate functional coverage reports to reflect requirements which have not been
addressed in the project
Generate test coverage reports to identify requirements which have not been taken into
account for testing purposes
Features:
Automatic conversion of links to suspect when requirements change
Trigger alerts to concerned parties when requirements change
Facility to identify the change in the request at click of a button
Features:
Identify Orphan requirements or hanging requirements (Dark Bands)
Identify implied links (shown in red circle)
Generate links on the fly (Not shown)
Features:
Trace to requirements in external projects
Trace to artifacts in configuration management tools
Trace to artifacts in design tools

Вам также может понравиться