Wednesday, August 24, 2011

Data warehouse glossary


For our ease lets divide our topic in 2 components of data warehouse and there we will get to know the common terms that are commonly used in the particular components. Since many terms can be used in across section, if you feel like it should be the part of other component or I have missed some, comments are appreciated.


Disclaimer: My sole objective here is to put across most of the jargon used in data warehousing not segregation of data warehouse into components.
  • Data warehouse
  • Business Intelligence ( reporting)
Some commonly used terminology in Data warehouse:
  • Aggregates: Physical rows in the database, almost always created by summing the other records in the database for the purpose of improving query performance. There are many   FACTS tables in the data warehouse that are summed for selected dimensions from the original fact table. Thus these tables will be having fewer rows and hence the query   execution will be faster.
  • Attribute: A column field in the DIM (Dimension) table.
  • Conformed dimension: If two dimensions are same or one is exactly the perfect subset of other, they are called conformed dimension. The results produced by either of these dimension with any fact should be same.
  • Data extract: Process of copying from source system (in this case operational system) in order to load in the target system (Data warehouse).
  • Data Mart: A logical and physical subset of data warehouse that caters the requirement for the specific area of the business process.
  • Data Mining: Is referred as finding the pattern from the most atomic data.
  • Data presentation Area: Is the place where data is organized, stored, and available for direct querying or from any data access tools or from analytics tool.
  • Data Staging area: Is the area where data is kept temporarily after extracting from the source system and is used for data clean, transform, de dup, and archive (in some case).
  • Data warehouse: Is the subject oriented, integrated, time-variant, and non-volatile collection of data for supporting organization’s decision making process.
  • Degenerate Dimension: Dimension key, such as invoice number or transaction number or bill number, that has no attributes and hence does not joins with the actual dimension table.
  • Dimension table: The table in data warehouse that has descriptive attribute columns for a entity.
  • Meta Data: Any data maintained to support the operations or use of data warehouse.
  • Slowly changing dimensions (SCD): The tendency of dimension rows to change gradually or occasionally over time. A type 1 SCD is a dimension whose attributes are overwritten when the value of an attribute changes. A type 2 SCD is a dimension where a new row is created when the value of an attribute changes. A type 3 SCD is a dimension where an alternate old column is created when an attribute changes.
     Some commonly used terminology in Business Intelligence:
  • AdHoc Queries / Reports: Queries: Reports that are formulated/ designed by the user on the spur of the moment.
  • Analytic processing: Using data for analytic purpose to support business decision making.
  • Business Intelligence: Is the generic term to describe the leveraging the organization’s internal and external asset for better Business decision.
  • Cubes: Name of dimensional structure on a OLAP database platform, originally referring to the simple three dimension. Like in case of the product, time & place.
  • ODS: A physical set of tables sitting between operational system and data warehouse. ODS caters the demand of immediate / real time reporting from the dashboard.
  • Slice and Dice: Ability to access a data warehouse through any of its dimension equally. Slicing and dicing is the process of separating and combining warehouse data in seemingly endless combinations.

Sunday, August 21, 2011

Testing Data warehouse


Data warehousing plays a pivotal role in identifying the significant trends and aspects of real world business based on the historical and real time data available in the warehouse. Testing has proven to be a necessity to sustain the data accuracy and credibility of data warehouse system.
Hardly any standard methodologies have been defined to perform testing in a data warehouse environment. Testing takes a different course when it comes to test data ware house. Where as in traditional approach emphasis is on testing functionality or business logic of a system, testing data warehouse revolves on testing the real time data and information provided by these data which is much more tedious.


Data warehouse in a Nut- Shell:
Data warehouse is the repository of an organization, where its electronic data from heterogeneous source is stored in single place. This stored data then supports and implements querying and analysis; for the purpose of decision making. The data in the data warehouse is not simply the collection of data from heterogeneous sources but it is the data which conforms the business rules. For this data is Extracted (from heterogeneous systems), Transformed (To business rule) and Loaded (In data warehouse).
According to W.H Inmon, the data in the data warehouse should satisfy following properties:
  •  Subject oriented.
  •  Integrated. 
  • Time variant. 
  • Non-Volatile.

Components of data ware house
As above picture depict data warehouse is build with the different components and processes that ultimately stores all the data in the data warehouse.
Source Systems
This is the first component of a data warehouse, all the data which is there is data are collected from the source systems. They can be categorized in OLTP systems, Flat files, Legacy systems, and External data.
Extract Transform Load (ETL) 
On the testing standpoint data warehouse testing starts from this point. Nevertheless, for the any data warehouse project ETL tool is the back bone of data warehouse, it is responsible for the loading the source data into the target table. To name Informatica power center, Data Stage, AbInito are few ETL tools, there are few open source ETL tools available to name few Pentaho Kettle, Talend etc.
Operational Data Store (ODS)
An operational data store is the repository that contains the aggregated data from the source systems its main purpose is to cater the operational data for the reporting that requires real time or near real time data.
Staging Area 
Is a temporary area where data from the source system is copied and kept for some time so that the transformation in data can be carried out without putting extra load to the source systems.Again to use or not to use staging area depends of the architect if data warehouse.
Data Marts
A data mart (DM) is the access layer of the data warehouse (DW) environment that is used to get data out to the users. The DM is a subset of the DW, usually oriented to a specific business line or team.
Analysis and Reporting Tools
These tools helps end user to access the data from the data warehouse for analysis purpose.
To name few OBIEE from Oracle, Cognos from IBM, Business Object (B.O) from SAP etc are few commercial tools available in market today.

Testing data ware house:
Normal process that applies for testing generic system testing is valid for the data warehouse testing as well. They are as below, as these steps are same we will concentrate on the approach to test data warehouse. 
  • Requirement Analysis 
  • Testing planning 
  • Test requirement creation 
  • Test case creation 
  • Test execution 
  • Verification & Validation 
  • Review and walkthroughs
Approach to Test Data warehouse


·         Data Warehouse Testing All the objects that are the part of data warehouse and those that reside in different data bases, & schema like table, triggers, indexes, views etc.
Ø  Validate all the DB objects
Ø  Validate trigger logic from audit schema
Ø  Validate table level constraints
·         Extraction Transformation and Load (ETL) Layer Testing – The script behind all the procedure in the ETL layers that is responsible for loading the data from the source to ODS (or staging) DB’s and from ODS (or staging) to data warehouse DIM and FACT tables is tested
Ø  Validation of extraction logic
Ø  Validation of transformational logic
Ø  Checksum comparison
Ø  Data completeness
Ø  Data integrity
Ø  Data validation
Ø  Initial/Incremental load
Ø  Testing end to end data flow from Source – Target – Report
·         Reports Testing – Listed below are the tasks that are carried out as a part of reports testing:
Ø  UI validation like Report interface, downloading, pagination, summation, grand totals
Ø   Drill down reports, sorting, filters
Ø  Verifying report’s data
Ø  Comparison within different types of reports

QA’s test entry point on the typical data warehouse architecture.





For testing data warehouse the initial phase should start from business understanding. This is vital for a tester as to a developer. Test engineers should take part in business and technical walkthroughs of the requirements.

Testing should be performed for each and every component of data warehouse as ETL is more concentrated part in above components as data movement happens in ETL. Key data warehouse test strategy is to perform both Analysis based and Queries based test execution this happens to be on the Front end part of data warehouse like reports, Ad-Hoc reports & charts etc. here we test the reports for the data from the data warehouse and same is cross verified by executing related queries in data base. Any test strategy for testing data warehouse should essentially concentrate on the below points.

1. Functional testing
In this test, For ETL processes ensure that all the mappings that are deployed in the QA environment are working as expected, verify its Source, Target, Connections parameter, Insert, Update strategy, Error logs are generating, failure / report notifications that are sent via Informatica are proper. Ensure all mappings are working fine without any error. For Reports ensure that the data from UI and DB are same; there is no view display error and other general UI test scenarios.

2. Integration testing
Ensure working of the data warehouse application as a whole. Integration test cases should contain the scenario for sequenced / scheduled ETL jobs. Test strategies should be devised to verify reaction of data warehouse application in case of fail over’s.

3. Regression testing
As Stated above the the requirement can’t be freeze in any data warehouse application so it is certain that there will be new requirement or enhancements in existing requirements. In regression ensure that all the existing code still works new releases. Normally for the data warehouse application it is verified by running all the functional test cases of the impacted area.

4. Performance testing
Scalability and robustness should be intrinsic nature of any data warehouse application. As the requirements are not stagnant in these types of applications, scalability should not be the bottle neck of these applications new requirements should be added successfully without having any performance degradation in the existing one. Hence performance testing is the critical & this is done using the voluminous data that is in sync with the max load count of the ETL routines.

5. User Acceptance testing
Is carried by the business users since they are not aware of ETL routine that needs to run before verifying, the technology team should properly document/ explain business team the process of data loading to the data warehouse and convey the same using business rules.






Friday, August 19, 2011

Testing Data Quality

Data Quality Defined:- Quality data is, simply put, Data that meets business demands.


Why Data Quality ?
According to Larry English " Father of Data Quality", the business costs of non quality data including irrecoverable costs, rework of product or services, workaround and lost and missed revenues may be as High as 10 to 25 percentage of revenue or total budget of an organization.

Data Quality Impact:
Organizations with poor data quality spend time working with conflicting reports and flawed business plans, resulting in erroneous decisions that are made with outdated, inconsistent,and invalid data.

Approach to test data quality:
Data quality test is the integral part of data warehouse test plan. But since as we have taken this as a different topic altogether let's see how we can achieve the data quality test. Data quality frame work mainly deals with the below dimensions of data, test should be designed to validate below dimensions.   

1. Row count: Record count at Source and Target should be same at a given point of time. Difference might occur due to Missing records, extra records in target, duplicates in target. This is usually tested by checking the row count of source and target for a particular date range.

2. Completeness: All the data at any given point of the time should be same in the Source and Target and should be in accordance with the business rules. This is usually done by testing the data value in the source and target. For daily health check this testing this dimension is usually ignored. Completeness is thoroughly verified at the time of functional, regression test execution.

3. Consistency: Data should be consistent for any changes happened in transactions; this is to ensure that the view of that data from any marts results in the consistent view. This is tested by querying the same data from the different sources, as the changes in DWH system are dynamic this tests should be performed on the every health check runs.

4. Redundancy: Redundant data can be in 2 forms, Physical duplicates and Logical duplicates. Physical duplicates are having multiple entries of the same data in same table whereas a logical duplicate signifies there are 2 entries for one single entity. Checks for the physical duplicates should be performed with every run of DWH system health check and test for logical duplicates should be performed at the functional and regression rounds.

5. Data Integrity: The change in data should be reflected in all the places. Again a test for this dimension is ignored in periodic /daily health check run

6. Accuracy: Is the degree to which data correctly reflects real world objects. This is also ignored on daily / periodic health check as this dimension should be extensively tested on functional and regression round. Timeliness: Updated data should be available when it is required. Should be tested on every run of health check
.
Test cycle for the data quality in data warehouse should be performed as per SLA. If the data warehouse is Near Real Time data warehouse, then it should be tested every day, or at least for the DIMS and FACTS that constitutes Near Real Time. Usually it is automated but if not then the manual execution should be carried out. 
Note: Test approach depends on the implemented Data Quality Framework.