DWH/ETL Concepts: DWH Questions

What is a data warehouse, and why is it used?

A data warehouse is a repository of data. The pieces of information stored are relevant to each other and support the decision making tree of some corporation or entity. It can incorporate multiple data sources to store all the data connected to the subject. Tipically it is composed by archive or historical data that can be analised. A data warehouse is supported on a database system.

What are the basic stages of a data warehouse?

The first stage to build a data warehouse is the initial data introduction, tipically this can be achived by copying some operational database. This is called and offline operational database. Then, we have to feed new sets of data to the newest created data warehouse. Therefore, this database is updated with large sets of data in a regular time basis (week, month). With this step, we’ve successfully built a offline data warehouse.

To achived a Real-time data warehouse you have to insert the operational data in real time. When this is integrated with the application, reporting on the data, it’s called a Integrated data warehouse.

What is OLAP and OLTP, and which are their main differences?

OLAP performs the analysis on the data, reporting the information. The focus on these kind of systems is the reading of data, thus using the SELECT database statement. OLTP manages the transaction system that collects the data. Actions like INSERT, UPDATE or DELETE are the focus here.

This topic is covered in a much more detailed way in this OLTP and OLAP article, provided by us.

What is a fact table?

The fact table is a concrete measure that is tipically stored as numeric values, they have the core business information.

In detail, the fact table contains two different kinds of information. The foreign keys to the related dimension tables, providing joining relationships, and the measure columns which represent the added data.

And a dimension table?

Dimension tables describe the quantified data on the fact tables, giving context on its fields. They contain descriptive attributes which provide more information related to the fact table.

Fact tables have foreign keys to the dimension ones and the relation is one to many.

Describe the star schema.

In the star schema we have a centralized fact table and multiple dimensions linked to it. These dimensions are only related to the fact table, so the only link they have is to that specific table.

The fact table relates to the dimensions having their primary keys as foreign keys, and other extra attributes relevant to the data warehouse.

Therefore this kind of schema is denormalized and better for simple querys, which are usually faster.

The next diagram represents a simple star schema implementation.

Describe the snowflake schema.

The Snoflake schema have links and relationships between dimensions, becoming a normalized organization of tables: fact and dimensions. This type of schema is usually more complex because each dimension can be composed of many other dimensions.

This kind of organization is explain in the next schema.

What is a OLAP cube?

A OLAP data cube is a representation of data in multiple dimensions, using facts and dimensions. It is characterized by the combination of information according to it’s relationship.

It can consist in a colection of 0 to many dimensions, representing specific data. There are five basic operation to perform on these kind of data cubes:

Slicing
Dicing
Roll-Up
Drill-Up and Drill-Down
Pivoting

Explain the slicing operation.

The slicing operation on a OLAP Cube establishes a single value for one of the dimensions of the cube, selecting all the data that corresponds to the selected value.

So, by executing a slice on the cube we get all the selected dimension and fact information for the specific value assigned.

Explain the dicing operation

Dicing on OLAP Cubes consists on choosing an interval of values for some of the dimensions representing in the cube, and selecting the data that corresponds to those intervals.

This operation creates a subset of the cube which contains the data between the intervals.

Explain the roll up operation.

The roll-up operation performs some computing rules on the data of a OLAP cube specific dimension, returning the computed information to the end user.

These applied rules can be defined and summarize the information on that specific dimension.

Explain the drill-up/drill-down operation

These operations allow the exploration of information between the levels of data presented on dimensions and facts on the data warehouse.

It can select summarized information or the details that compose that data aggregation.

Explain the pivoting operation

Pivoting allows the rotation of the cube on its dimensions providing the user a different point of view of the explored data.

The cube can be rotated on every face.

Explain the concept of data mart.

Data mart is a specific group of data linked to a subject, which is part of a specific data warehouse. Therefore, a data warehouse have multiple data marts.

Basically a data mart is a small data warehouse with condensed information about a specific subject and it’s relationships. Usually each data mart is related to a department, business unit or something that can function individually within a data warehouse.

Which are the reasons to create a Data Mart?

There are various reasons that lead to a the creation of a data mart. The most important ones are:

Create a data specific environment, providing easy access to it
Easy to create
Data is more relevant to users having only the essential information
Lower cost than creating a whole data warehouse

What does Normalization mean?

Normalization is the process in which tables and fields are organized in a database in order to reduce the redundancy of stored data. Therefore many relationships between tables are defined, providing a better organized database system.

The key benefits of normalization are:

Low database data redundancy
Searching and indexing is faster
Fewer null values since data is well distributed
Cleaner and easier to mantain

What is a ETL process?

A ETL process consists on getting data from different sources and converting it to enter in a specific data warehouse.

Theses processes transform and normalize the data, providing a common base for all sources to integrate with a data warehouse.

What is aggregation?

Aggregation is the representation of a set of data, joined by some aggregation function.

This functions may be simple or complex depending of the purpose of the selected aggregation data. A simple function is the sum of every value.

Explain what is partitioning.

Partitioning is the process of dividing all data warehouse elements into smaller and distinct sets of data, keeping the relationships between the elements.

The benefits of partitioning are:

Easy management
Better performance
Availability
Easier backup and recovery

What types of dimensions do you know?

There are four common kinds of dimensions in a data warehouse:

Conformed Dimension
Degenerated Dimension
Role-Playing Dimension
Junk Dimension

Describe a conformed dimension.

A conformed dimension is shared between various subjects in the data warehouse. Therefore it is widely used in different contexts, meaning the same thing in each one of them.

Explain what is a degenerated dimension.

The degenerated dimension is derived from a fact table and doesn’t have it’s own dimension table.

What is a role-playing dimension?

A role-playing dimension has multiple applications within the same Data Warehouse and it is reused for different purposes. One example is an ID. In a data warehouse we can have several kinds of ID’s: client id, product id, etc.

What are junk dimensions?

Junk dimensions are composed by some attributes that don’t fit in another tables and are usually used with rapidly changing dimensions.

What is the difference between metadata and data dictionary?

A data dictionary has all the definitions of a database, the tables and fields, rows, number of rows, and that kind of information.

Metadata describes some kind of information with addicional and important data which is complementary.

1. What is Datawarehousing?

A Datawarehouse is the repository of a data and it is used for Management decision support system. Datawarehouse consists of wide variety of data that has high level of business conditions at a single point in time.

In single sentence, it is repository of integrated information which can be available for queries and analysis.

2. What is Business Intelligence?

Business Intelligence is also known as DSS – Decision support system which refers to the technologies, application and practices for the collection, integration and analysis of the business related information or data. Even, it helps to see the data on the information itself.

3. What is Dimension Table?

Dimension table is a table which contain attributes of measurements stored in fact tables. This table consists of hierarchies, categories and logic that can be used to traverse in nodes.

4. What is Fact Table?

Fact table contains the measurement of business processes, and it contains foreign keys for the dimension tables.

Example – If the business process is manufacturing of bricks

Average number of bricks produced by one person/machine – measure of the business process

5. What are the stages of Datawarehousing?

There are four stages of Datawarehousing:

Datawarehouse

Offline Operational Database
Offline Data Warehouse
Real Time Datawarehouse
Integrated Datawarehouse

6. What is Data Mining?

Data Mining is set to be a process of analyzing the data in different dimensions or perspectives and summarizing into a useful information. Can be queried and retrieved the data from database in their own format.

7. What is OLTP?

OLTP is abbreviated as On-Line Transaction Processing, and it is an application that modifies the data whenever it received and has large number of simultaneous users.

8. What is OLAP?

OLAP is abbreviated as Online Analytical Processing, and it is set to be a system which collects, manages, processes multi-dimensional data for analysis and management purposes.

9. What is the difference between OLTP and OLAP?

Following are the differences between OLTP and OLAP:

OLTP	OLAP
Data is from original data source	Data is from various data sources
Simple queries by users	Complex queries by system
Normalized small database	De-normalized Large Database
Fundamental business tasks	Multi-dimensional business tasks

10. What is ODS?

ODS is abbreviated as Operational Data Store and it is a repository of real time operational data rather than long term trend data.

11. What is the difference between View and Materialized View?

A view is nothing but a virtual table which takes the output of the query and it can be used in place of tables.

A materialized view is nothing but an indirect access to the table data by storing the results of a query in a separate schema.

12. What is ETL?

ETL is abbreviated as Extract, Transform and Load. ETL is a software which is used to reads the data from the specified data source and extracts a desired subset of data. Next, it transform the data using rules and lookup tables and convert it to a desired state.

Then, load function is used to load the resulting data to the target database.

13. What is VLDB?

VLDB is abbreviated as Very Large Database and its size is set to be more than one terabyte database. These are decision support systems which is used to server large number of users.

14. What is real-time datawarehousing?

Real-time datawarehousing captures the business data whenever it occurs. When there is business activity gets completed, that data will be available in the flow and become available for use instantly.

15. What are Aggregate tables?

Aggregate tables are the tables which contain the existing warehouse data which has been grouped to certain level of dimensions. It is easy to retrieve data from the aggregated tables than the original table which has more number of records.

This table reduces the load in the database server and increases the performance of the query.

16. What is factless fact tables?

A factless fact tables are the fact table which doesn’t contain numeric fact column in the fact table.

17. How can we load the time dimension?

Time dimensions are usually loaded through all possible dates in a year and it can be done through a program. Here, 100 years can be represented with one row per day.

18. What are Non-additive facts?

Non-Addictive facts are said to be facts that cannot be summed up for any of the dimensions present in the fact table. If there are changes in the dimensions, same facts can be useful.

19. What is conformed fact?

Conformed fact is a table which can be used across multiple data marts in combined with the multiple fact tables.

20. What is Datamart?

A Datamart is a specialized version of Datawarehousing and it contains a snapshot of operational data that helps the business people to decide with the analysis of past trends and experiences. A data mart helps to emphasizes on easy access to relevant information.

21. What is Active Datawarehousing?

An active datawarehouse is a datawarehouse that enables decision makers within a company or organization to manage customer relationships effectively and efficiently.

22. What is the difference between Datawarehouse and OLAP?

Datawarehouse is a place where the whole data is stored for analyzing, but OLAP is used for analyzing the data, managing aggregations, information partitioning into minor level information.

23. What is ER Diagram?

ER diagram is abbreviated as Entity-Relationship diagram which illustrates the interrelationships between the entities in the database. This diagram shows the structure of each tables and the links between the tables.

24. What are the key columns in Fact and dimension tables?

Foreign keys of dimension tables are primary keys of entity tables. Foreign keys of fact tables are the primary keys of the dimension tables.

25. What is SCD?

SCD is defined as slowly changing dimensions, and it applies to the cases where record changes over time.

26. What are the types of SCD?

There are three types of SCD and they are as follows:

SCD 1 – The new record replaces the original record

SCD 2 – A new record is added to the existing customer dimension table

SCD 3 – A original data is modified to include new data

27. What is BUS Schema?

BUS schema consists of suite of confirmed dimension and standardized definition if there is a fact tables.

28. What is Star Schema?

Star schema is nothing but a type of organizing the tables in such a way that result can be retrieved from the database quickly in the data warehouse environment.

29. What is Snowflake Schema?

Snowflake schema which has primary dimension table to which one or more dimensions can be joined. The primary dimension table is the only table that can be joined with the fact table.

30. What is a core dimension?

Core dimension is nothing but a Dimension table which is used as dedicated

for single fact table or datamart.

31. What is called data cleaning?

Name itself implies that it is a self explanatory term. Cleaning of Orphan records,

Data breaching business rules, Inconsistent data and missing information in a database.

32. What is Metadata?

Metadata is defined as data about the data. The metadata contains information like

number of columns used, fix width and limited width, ordering of fields and data

types of the fields.

33. What are loops in Datawarehousing?

In datawarehousing, loops are existing between the tables. If there is a loop between

the tables, then the query generation will take more time and it creates ambiguity. It is advised to

avoid loop between the tables.

34. Whether Dimension table can have numeric value?

Yes, dimension table can have numeric value as they are the descriptive elements

of our business.

35. What is the definition of Cube in Datawarehousing?

Cubes are logical representation of multidimensional data. The edge of the cube has

the dimension members,and the body of the cube contains the data values.

36. What is called Dimensional Modelling?

Dimensional Modeling is a concept which can be used by dataware house designers

to build their own datawarehouse. This model can be stored in two types of tables –

Facts and Dimension table.

Fact table has facts and measurements of the business and dimension table contains

the context of measurements.

37. What are the types of Dimensional Modeling?

There are three types of Dimensional Modeling and they are as follows:

Conceptual Modeling
Logical Modeling
Physical Modeling

38. What is surrogate key?

Surrogate key is nothing but a substitute for the natural primary key. It is set to be a

unique identifier for each row that can be used for the primary key to a table.

39. What is the difference between ER Modeling and Dimensional Modeling?

ER modeling will have logical and physical model but Dimensional modeling will have

only Physical model.

ER Modeling is used for normalizing the OLTP database design whereas Dimensional

Modeling is used for de-normalizing the ROLAP and MOLAP design.

40. What are the steps to build the datawarehouse?

Following are the steps to be followed to build the datawaerhouse:

Gathering business requirements
Identifying the necessary sources
Identifying the facts
Defining the dimensions
Defining the attributes
Redefine the dimensions and attributes if required
Organize the Attribute hierarchy
Define Relationships
Assign unique Identifiers

41. What are the different types of datawarehosuing?

Following are the different types of Datawarehousing:

Enterprise Datawarehousing
Operational Data Store
Data Mart

42. What needs to be done while starting the database?

Following need to be done to start the database:

Start an Instance
Mount the database
Open the database

43. What needs to be done when the database is shutdown?

Following needs to be done when the database is shutdown:

Close the database
Dismount the database
Shutdown the Instance

44. Can we take backup when the database is opened?

No, We cannot take full backup when the database is opened.

45. What is defined as Partial Backup?

A Partial backup in an operating system is a backup short of full backup and it can be done

while the database is opened or shutdown.

46. What is the goal of Optimizer?

The goal to Optimizer is to find the most efficient way to execute the SQL statements.

47. What is Execution Plan?

Execution Plan is a plan which is used to the optimizer to select the combination of the steps.

48. What are the approaches used by Optimizer during execution plan?

There are two approaches:

Rule Based
Cost Based

49. What are the tools available for ETL?

Following are the ETL tools available:

Informatica
Data Stage
Oracle
Warehouse Builder
Ab Initio
Data Junction

50.What is the difference between metadata and data dictionary?
Metadata is defined as data about the data. But, Data dictionary contain the information

about the project information, graphs, abinito commands and server information.

DWH/ETL Concepts

Pages

DWH Questions

No comments:

Post a Comment