DWH/ETL Concepts

Data Warehousing ETL tutorial
The ETL and Data Warehousing tutorial is organized into lessons representing various business intelligence scenarios, each of which describes a typical data warehousing challenge.
This guide might be considered as an ETL process and Data Warehousing knowledge base with a series of examples illustrating how to manage and implement the ETL process in a data warehouse environment.
The purpose of this tutorial is to outline and analyze the most widely encountered real life data warehousing problems and challenges that need to be taken during the design and architecture phases of a successful data warehouse project deployment.
Going through the sample implementations of the business scenarios is also a good way to compare Business Intelligence and ETL tools and get to know the different approaches to designing the data integration process. This also gives an idea and helps identify strong and weak points of various ETL and data warehousing applications.
This tutorial shows how to use the following BI, ETL and datawarehousing tools.

Data Warehousing & ETL Tutorial lessons
Surrogate key generation example which includes information on business keys and surrogate keys and shows how to design an ETL process to manage surrogate keys in a data warehouse environment. Sample design in Pentaho Data Integration
Header and trailer processing - considerations on processing files arranged in blocks consisting of a header record, body items and a trailer. This type of files usually come from mainframes, also it applies to EDI and EPIC files. Solution examples in Datastage, SAS and Pentaho Data Integration
Loading customers - a data extract is placed on an FTP server. It is copied to an ETL server and loaded into the data warehouse. Sample loading in Teradata MultiLoad
Data allocation ETL process case study for allocating data. Examples in Pentaho Data Integration and Cognos PowerPlay
Data masking and scrambling algorithms and ETL deployments. Sample Kettle implementation
Site traffic analysis a guide to creating a data warehouse with data marts for website traffic analysis and reporting. Sample design in Pentaho Kettle
Data Quality ETL process design aimed to test and cleanse data in a Data Warehouse. Sample outline in PDI
XML ETL processing

ETL stands for Extract, Transform and Load. An ETL tool extracts the data from different RDBMS source systems, transforms the data like applying calculations, concatenate, etc. and then load the data to Data Warehouse system. The data is loaded in the DW system in the form of dimension and fact tables.
Extraction
A staging area is required during ETL load. There are various reasons why staging area is required.
The source systems are only available for specific period of time to extract data. This period of time is less than the total data-load time. Therefore, staging area allows you to extract the data from the source system and keeps it in the staging area before the time slot ends.
Staging area is required when you want to get the data from multiple data sources together or if you want to join two or more systems together. For example, you will not be able to perform a SQL query joining two tables from two physically different databases.
Data extractions’ time slot for different systems vary as per the time zone and operational hours.
Data extracted from source systems can be used in multiple data warehouse system, Operation Data stores, etc.
ETL allows you to perform complex transformations and requires extra area to store the data.
ETL Extraction
Transform
In data transformation, you apply a set of functions on extracted data to load it into the target system. Data, which does not require any transformation is known as direct move or pass through data.
You can apply different transformations on extracted data from the source system. For example, you can perform customized calculations. If you want sum-of-sales revenue and this is not in database, you can apply the SUM formula during transformation and load the data.
For example, if you have the first name and the last name in a table in different columns, you can use concatenate before loading.
Load
During Load phase, data is loaded into the end-target system and it can be a flat file or a Data Warehouse system.

Online Analytical Processing

In an OLAP system, there are less number of transactions as compared to a transactional system. The queries executed are complex in nature and involves data aggregations.

What is an Aggregation?

We save tables with aggregated data like yearly (1 row), quarterly (4 rows), monthly (12 rows) or so, if someone has to do a year to year comparison, only one row will be processed. However, in an un-aggregated table it will compare all rows.

SELECT SUM(salary)
FROM employee
WHERE title = 'Programmer';

Effective Measures in an OLAP system

Response time is known as one of the most effective and key measure in an OLAP system. Aggregated stored data is maintained in multi-dimensional schemas like star schemas (When data is arranged into hierarchical groups, often called dimensions and into facts and aggregate facts, it is called Schemas).

The latency of an OLAP system is of a few hours as compared to the data marts where latency is expected closer to a day.

Online Transaction Processing

In an OLTP system, there are a large number of short online transactions such as INSERT, UPDATE, and DELETE.

In an OLTP system, an effective measure is the processing time of short transactions and is very less. It controls data integrity in multi-access environments. For an OLTP system, the number of transactions per second measures the effectiveness. An OLTP data warehouse system contains current and detailed data and is maintained in the schemas in the entity model (3NF).

Example

Day-to-Day transaction system in a retail store, where the customer records are inserted, updated and deleted on a daily basis. It provides very fast query processing. OLTP databases contain detailed and current data. Schema used to store OLTP database is the Entity model.

Differences between OLTP and OLAP

The following illustrations shows the key differences between an OLTP and OLAP system.

Indexes − OLTP system has only few indexes while in an OLAP system there are many indexes for performance optimization.
Joins − In an OLTP system, large number of joins and data are normalized. However, in an OLAP system there are less joins and are de-normalized.
Aggregation − In an OLTP system, data is not aggregated while in an OLAP database more aggregations are used.

Predictive Analysis

Predictive analysis is known as finding the hidden patterns in data stored in DW system by using different mathematical functions to predict future outcomes.

Predictive Analysis system is different from an OLAP system in terms of its use. It is used to focus on future outcomes. An OALP system focuses on current and historical data processing for analytical reporting.

Data models define how the logical structure of a database is modeled. Data Models are fundamental entities to introduce abstraction in a DBMS. Data models define how data is connected to each other and how they are processed and stored inside the system.

The very first data model could be flat data-models, where all the data used are to be kept in the same plane. Earlier data models were not so scientific, hence they were prone to introduce lots of duplication and update anomalies.

Entity-Relationship Model

Entity-Relationship (ER) Model is based on the notion of real-world entities and relationships among them. While formulating real-world scenario into the database model, the ER Model creates entity set, relationship set, general attributes and constraints.

ER Model is best used for the conceptual design of a database.

ER Model is based on −

Entities and their attributes.
Relationships among entities.

These concepts are explained below.

Entity − An entity in an ER Model is a real-world entity having properties called attributes. Every attribute is defined by its set of values called domain. For example, in a school database, a student is considered as an entity. Student has various attributes like name, age, class, etc.
Relationship − The logical association among entities is called relationship. Relationships are mapped with entities in various ways. Mapping cardinalities define the number of association between two entities.

Mapping cardinalities −
- one to one
- one to many
- many to one
- many to many

Relational Model

The most popular data model in DBMS is the Relational Model. It is more scientific a model than others. This model is based on first-order predicate logic and defines a table as an n-ary relation.

The main highlights of this model are −

Data is stored in tables called relations.
Relations can be normalized.
In normalized relations, values saved are atomic values.
Each row in a relation contains a unique value.
Each column in a relation contains values from a same domain.

DWH/ETL Concepts

Pages

Tuesday, 14 February 2017