Friday, August 19, 2011

Testing Data Quality

Data Quality Defined:- Quality data is, simply put, Data that meets business demands.


Why Data Quality ?
According to Larry English " Father of Data Quality", the business costs of non quality data including irrecoverable costs, rework of product or services, workaround and lost and missed revenues may be as High as 10 to 25 percentage of revenue or total budget of an organization.

Data Quality Impact:
Organizations with poor data quality spend time working with conflicting reports and flawed business plans, resulting in erroneous decisions that are made with outdated, inconsistent,and invalid data.

Approach to test data quality:
Data quality test is the integral part of data warehouse test plan. But since as we have taken this as a different topic altogether let's see how we can achieve the data quality test. Data quality frame work mainly deals with the below dimensions of data, test should be designed to validate below dimensions.   

1. Row count: Record count at Source and Target should be same at a given point of time. Difference might occur due to Missing records, extra records in target, duplicates in target. This is usually tested by checking the row count of source and target for a particular date range.

2. Completeness: All the data at any given point of the time should be same in the Source and Target and should be in accordance with the business rules. This is usually done by testing the data value in the source and target. For daily health check this testing this dimension is usually ignored. Completeness is thoroughly verified at the time of functional, regression test execution.

3. Consistency: Data should be consistent for any changes happened in transactions; this is to ensure that the view of that data from any marts results in the consistent view. This is tested by querying the same data from the different sources, as the changes in DWH system are dynamic this tests should be performed on the every health check runs.

4. Redundancy: Redundant data can be in 2 forms, Physical duplicates and Logical duplicates. Physical duplicates are having multiple entries of the same data in same table whereas a logical duplicate signifies there are 2 entries for one single entity. Checks for the physical duplicates should be performed with every run of DWH system health check and test for logical duplicates should be performed at the functional and regression rounds.

5. Data Integrity: The change in data should be reflected in all the places. Again a test for this dimension is ignored in periodic /daily health check run

6. Accuracy: Is the degree to which data correctly reflects real world objects. This is also ignored on daily / periodic health check as this dimension should be extensively tested on functional and regression round. Timeliness: Updated data should be available when it is required. Should be tested on every run of health check
.
Test cycle for the data quality in data warehouse should be performed as per SLA. If the data warehouse is Near Real Time data warehouse, then it should be tested every day, or at least for the DIMS and FACTS that constitutes Near Real Time. Usually it is automated but if not then the manual execution should be carried out. 
Note: Test approach depends on the implemented Data Quality Framework.

No comments:

Post a Comment