A Learner's Blog.....: Basic DWH Interview Questions.

What is the difference between OLAP and datawarehosue?

Data warehouse is the place where the data is stored for analysing

Whereas OLAP is the process of analyzing the data, managing aggregations, partitioning information into cubes for in-depth visualization.

ODS:- A collection of tables created in the Data warehouse that maintains only current data.

OLTP:- Maintains the data only for transactions, these are designed for recording daily operations and transactions of a business.

Explain What are non-additive facts in detail?

A fact may be measure, metric or a dollar value. Measure and metric are non additive facts.

Dollar value is additive fact. If we want to find out the amount for a particular place for a particular period of time, we can add the dollar amounts and come up with the total amount.

A non additive fact, for eg measure height(s) for 'citizens by geographical location' , when we rollup 'city' data to 'state' level data we should not add heights of the citizens rather we may want to use it to derive 'count'

which cant be summed up with any columns in the table(all dimension keys)

EX: ratio columns,profit margin

What is cubes?

Cube is used in DWH foar representing multidimensional data logically.

Explain Why Denormalization is promoted in Universe Designing?

In a relational data model, for normalization purposes, some lookup tables are not merged as a single table. In a dimensional data modelling(star schema), these tables would be merged as a single table called DIMENSION table for performance and slicing data. Due to this merging of tables into one large Dimension table, it comes out of complex intermediate joins. Dimension tables are directly joined to Fact tables. Though, redundancy of data occurs in DIMENSION table, size of DIMENSION table is 15% only when compared to FACT table. So only Denormalization is promoted in Universe Designing.

Hyperion is the one of the tool in data ware house. Its an olap tool. Why you cant display that tool?

Explain yourself

Explain What is fact less fact table? where you have used it in your project?

Fact less table means only the key available in the Fact there is no measures available.

Fact less fact table means it does not contain any facts(measures).It is used when we are integrating fact tables.

Explain what is aggregate table and aggregate fact table ... any examples of both?

Aggregate table contains summarized data. The materialized view are aggregated tables.

for ex in sales we have only date transaction. if we want to create a report like sales by product per year. in such cases we aggregate the date?vales into week_agg, month_agg, quarter_agg, year_agg. to retrive date from this tables we use @aggrtegate function.

aggregate table is one of the data transaction function and some time it is create a protect per year.this aggregate value is week agg ,month agg quarter agg function

Please explain in detail about

type 1,

type 2(SCD),

type 3?

Type-1 Most Recent Value

Type-2(full History)

i) Version Number

ii) Flag

iii) Date

Type-3--Current and one Perivies value

SCD'S slow change dimension

there are three types

scd1, scd2, scd3

scd1:- suppose it the data got updated in the table then there r 2 methods one is to drop the table and upload the new one. but its along process.

here by using table compression we can update the particular data that has been modified. that is scd1

scd2:- by using key generation we r going to generate the new rownum column if there r any update the next row will be updated one and row numbers will be increamented automatically by 1

scd3:- i this one a extra column is added and updated infromation is stored in that column. if again table is update another column is added to it again...

in this one scd2 is mostly used..

Where the applications and where

ware house management system is used?

Data warehousing system is used in OLAP systems. Systems in which mainly the analysis of the data is needed. High level and Top executives use this system for analysis purpose so that they can make correct decisions that can boost the productivity of the org.

What is snapshot?

You can disconnect the report from the catalog to which it is attached by saving the report with a snapshot of the data. However, you must reconnect to the catalog if you want to refresh the data.

What is active data warehousing?

An active data warehouse provides information that enables decision-makers within an organization to manage customer relationships nimbly, efficiently and proactively. Active data warehousing is all about integrating advanced decision support with day-to-day-even minute-to-minute-decision making in a way that increases quality of those customer touches which encourages customer loyalty and thus secure an organization's bottom line. The marketplace is coming of age as we progress from first-generation "passive" decision-support systems to current- and next-generation "active" data warehouse implementations

What is the difference between datawarehouse and BI?

Simply speaking, BI is the capability of analyzing the data of a datawarehouse in advantage of that business. A BI tool analyzes the data of a datawarehouse and to come into some business decision depending on the result of the analysis.

Business Intelligence is a collection of broad category of application programs and techniques used to querying,retrieving,reporting and analyzing the business informations multidimentionally.

Business Intelligence is a collection of application specifications which allow the client applications to retrieve business informations from the DataWare House in order to make some business decissions.

Explain Is OLAP databases are called decision support system true/false?

True

OLAP (online analytical processing) works by analysing aggregated data to give final reports to top management to take action/decisions on business which is same as DSS.(Decision support system)

Explain What is the difference between datawarehouse and BI?

Ware House Mangement is So Important in the Large number of Data's Handling Part.For Ex.In Ms.Access ,we can store 100 Thosand data Consistently and can be retrived.Same like Each and Every DB having them Own Speciality.From various DB number Columns like more than 200 Column ,Data Retrive is not so easy through SQL & PL&SQL.Lots lines should write.In this case to maintain the data the DWH is So important.Especialy for Banking,Insurance,Telecom,Business etc.

What is the difference between Datawarehousing and BusinessIntelligence?

Data warehousing deals with all aspects of managing the development, implementation and operation of a data warehouse or data mart including meta data management, data acquisition, data cleansing, data transformation, storage management, data distribution, data archiving, operational reporting, analytical reporting, security management, backup/recovery planning, etc. Business intelligence, on the other hand, is a set of software tools that enable an organization to analyze measurable aspects of their business such as sales performance, profitability, operational efficiency, effectiveness of marketing campaigns, market penetration among certain customer groups, cost trends, anomalies and exceptions, etc. Typically, the term ?business intelligence? is used to encompass OLAP, data visualization, data mining and query/reporting tools.Think of the data warehouse as the back office and business intelligence as the entire business including the back office. The business needs the back office on which to function, but the back office without a business to support, makes no sense.

As explained , Data warehouse contains the data or data mart which final product of all the process like meta data management , acquire , data cleansing , transformation and load other process as mentioned above. Business intelligence are set of software which connects to data mart to do various reporting useful for the business or good running of the company. All the business decision are taken based on the data warehousing reporting. Hope this make sense.

What is the difference between ODS and OLTP?

ODS:- It is nothing but a collection of tables created in the Data warehouse that maintains only current data.

Where as OLTP maintains the data only for transactions,these are designed for recording daily operations and transactions of a business.

What is data warehouse?

A data warehouse is a electronic storage of an Organization's historical data for the purpose of analysis and reporting. According to Bill Inmon, a datawarehouse should be subject-oriented, non-volatile, integrated and time-variant.

What is the benefits of data warehouse?

A data warehouse helps to integrate data (see Data integration) and store them historically so that we can analyze different aspects of business including, performance analysis, trend, prediction etc. over a given time frame and use the result of our analysis to improve the efficiency of business processes.

Why Data Warehouse is used?

For a long time in the past and also even today, Data warehouses are built to facilitate reporting on different key business processes of an organization, known as KPI. Data warehouses also help to integrate data from different sources and show a single-point-of-truth values about the business measures.

Data warehouse can be further used for data mining which helps trend prediction, forecasts, pattern recognition etc.

What is the difference between OLTP and OLAP?

OLTP is the transaction system that collects business data. Whereas OLAP is the reporting and analysis system on that data.

OLTP systems are optimized for INSERT, UPDATE operations and therefore highly normalized. On the other hand, OLAP systems are deliberately denormalized for fast data retrieval through SELECT operations.

What is data mart?

Data marts are generally designed for a single subject area. An organization may have data pertaining to different departments like Finance, HR, Marketting etc. stored in data warehouse and each department may have separate data marts. These data marts can be built on top of the data warehouse.

What is ER model?

ER model or entity-relationship model is a particular methodology of data modeling wherein the goal of modeling is to normalize the data by reducing redundancy. This is different than dimensional modeling where the main goal is to improve the data retrieval mechanism.

What is dimensional modeling?

Dimensional model consists of dimension and fact tables. Fact tables store different transactional measurements and the foreign keys from dimension tables that qualifies the data. The goal of Dimensional model is not to achive high degree of normalization but to facilitate easy and faster data retrieval.

Ralph Kimball is one of the strongest proponents of this very popular data modeling technique which is often used in many enterprise level data warehouses.

What is dimension?

A dimension is something that qualifies a quantity (measure).

For an example, consider this: If I just say… “20kg”, it does not mean anything. But if I say, "20kg of Rice (Product) is sold to Ramesh (customer) on 5th April (date)", then that gives a meaningful sense. These product, customer and dates are some dimension that qualified the measure - 20kg.

Dimensions are mutually independent. Technically speaking, a dimension is a data element that categorizes each item in a data set into non-overlapping regions.

What is Fact?

A fact is something that is quantifiable (Or measurable). Facts are typically (but not always) numerical values that can be aggregated.

What are additive, semi-additive and non-additive measures?

Non-additive Measures

Non-additive measures are those which can not be used inside any numeric aggregation function (e.g. SUM(), AVG() etc.). One example of non-additive fact is any kind of ratio or percentage. Example, 5% profit margin, revenue to asset ratio etc. A non-numerical data can also be a non-additive measure when that data is stored in fact tables, e.g. some kind of varchar flags in the fact table.

Semi Additive Measures

Semi-additive measures are those where only a subset of aggregation function can be applied. Let’s say account balance. A sum() function on balance does not give a useful result but max() or min() balance might be useful. Consider price rate or currency rate. Sum is meaningless on rate; however, average function might be useful.

Additive Measures

Additive measures can be used with any aggregation function like Sum(), Avg() etc. Example is Sales Quantity etc.

What is Star-schema?

This schema is used in data warehouse models where one centralized fact table references number of dimension tables so as the keys (primary key) from all the dimension tables flow into the fact table (as foreign key) where measures are stored. This entity-relationship diagram looks like a star, hence the name.

https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjxHGs4Bkq2-xVuz5C10jEfQkALSsVfLugjXaJl8GVUduOb52_bcJqC7ppkH3b7K8FJ7O1dgfcCEsTtcthDNCGVP9mC9daeJC1y5XnwU0U1JXDHRYHo2ljmyt_k1rqZCXQqll5UDOgPUANJ/s640/S.png

What is snow-flake schema?

This is another logical arrangement of tables in dimensional modeling where a centralized fact table references number of other dimension tables; however, those dimension tables are further normalized into multiple related tables.

Consider a fact table that stores sales quantity for each product and customer on a certain time. Sales quantity will be the measure here and keys from customer, product and time dimension tables will flow into the fact table. Additionally all the products can be further grouped under different product families stored in a different table so that primary key of product family tables also goes into the product table as a foreign key. Such construct will be called a snow-flake schema as product table is further snow-flaked into product family.

https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhmsfOcT_E3mW7aFyaRxVg-Z5lfGl0cgz6VB64VPIhaYiE-qV8hEXbCytq4s4bjTQPZpX9tHlm7sWMOOuZaoMFrQV0hF2Q5bWIaMoEmz-qjBtCCUGGQuUSYqfcIVKeYGQhODc791-CfJcfQ/s640/S1.png

Note
Snow-flake increases degree of normalization in the design.

What are the different types of dimension?
In a data warehouse model, dimension can be of following types,

1. Conformed Dimension

2. Junk Dimension

3. Degenerated Dimension

4. Role Playing Dimension

Based on how frequently the data inside a dimension changes, we can further classify dimension as

1. Unchanging or static dimension (UCD)

2. Slowly changing dimension (SCD)

3. Rapidly changing Dimension (RCD)

What is a 'Conformed Dimension'?

A conformed dimension is the dimension that is shared across multiple subject area. Consider 'Customer' dimension. Both marketing and sales department may use the same customer dimension table in their reports. Similarly, a 'Time' or 'Date' dimension will be shared by different subject areas. These dimensions are conformed dimension.

Theoretically, two dimensions which are either identical or strict mathematical subsets of one another are said to be conformed.

What is degenerated dimension?

A degenerated dimension is a dimension that is derived from fact table and does not have its own dimension table.

A dimension key, such as transaction number, receipt number, Invoice number etc. does not have any more associated attributes and hence can not be designed as a dimension table.

What is junk dimension?

A junk dimension is a grouping of typically low-cardinality attributes (flags, indicators etc.) so that those can be removed from other tables and can be junked into an abstract dimension table.

These junk dimension attributes might not be related. The only purpose of this table is to store all the combinations of the dimensional attributes which you could not fit into the different dimension tables otherwise. Junk dimensions are often used to implement Rapidly Changing Dimensions in data warehouse.

What is a role-playing dimension?

Dimensions are often reused for multiple applications within the same database with different contextual meaning. For instance, a "Date" dimension can be used for "Date of Sale", as well as "Date of Delivery", or "Date of Hire". This is often referred to as a 'role-playing dimension'

What is SCD?

SCD stands for slowly changing dimension, i.e. the dimensions where data is slowly changing. These can be of many types, e.g. Type 0, Type 1, Type 2, Type 3 and Type 6, although Type 1, 2 and 3 are most common. Read this article to gather in-depth knowledge on various SCD tables.

What is rapidly changing dimension?

This is a dimension where data changes rapidly. Read this article to know how to implement RCD.

Describe different types of slowly changing Dimension (SCD)

Type 0:

A Type 0 dimension is where dimensional changes are not considered. This does not mean that the attributes of the dimension do not change in actual business situation. It just means that, even if the value of the attributes change, history is not kept and the table holds all the previous data.

Type 1:

A type 1 dimension is where history is not maintained and the table always shows the recent data. This effectively means that such dimension table is always updated with recent data whenever there is a change, and because of this update, we lose the previous values.

Type 2:

A type 2 dimension table tracks the historical changes by creating separate rows in the table with different surrogate keys. Consider there is a customer C1 under group G1 first and later on the customer is changed to group G2. Then there will be two separate records in dimension table like below,

Key	Customer	Group	Start Date	End Date
1	C1	G1	1st Jan 2000	31st Dec 2005
2	C1	G2	1st Jan 2006	NULL

Note that separate surrogate keys are generated for the two records. NULL end date in the second row denotes that the record is the current record. Also note that, instead of start and end dates, one could also keep version number column (1, 2 … etc.) to denote different versions of the record.

Type 3:

A type 3 dimension stored the history in a separate column instead of separate rows. So unlike a type 2 dimension which is vertically growing, a type 3 dimension is horizontally growing. See the example below,

Key	Customer	Previous Group	Current Group
1	C1	G1	G2

This is only good when you need not store many consecutive histories and when date of change is not required to be stored.

Type 6:

A type 6 dimension is a hybrid of type 1, 2 and 3 (1+2+3) which acts very similar to type 2, but only you add one extra column to denote which record is the current record.

Key	Customer	Group	Start Date	End Date	Current Flag
1	C1	G1	1st Jan 2000	31st Dec 2005	N
2	C1	G2	1st Jan 2006	NULL	Y

What is a mini dimension?

Mini dimensions can be used to handle rapidly changing dimension scenario. If a dimension has a huge number of rapidly changing attributes it is better to separate those attributes in different table called mini dimension. This is done because if the main dimension table is designed as SCD type 2, the table will soon outgrow in size and create performance issues. It is better to segregate the rapidly changing members in different table thereby keeping the main dimension table small and performing.

What is a fact-less-fact?
A fact table that does not contain any measure is called a fact-less fact. This table will only contain keys from different dimension tables. This is often used to resolve a many-to-many cardinality issue.
Explanatory Note:
Consider a school, where a single student may be taught by many teachers and a single teacher may have many students. To model this situation in dimensional model, one might introduce a fact-less-fact table joining teacher and student keys. Such a fact table will then be able to answer queries like,

1. Who are the students taught by a specific teacher?

2. Which teacher teaches maximum students.

3. Which student has highest number of teachers.etc. etc.

What is a coverage fact?
A fact-less-fact table can only answer 'optimistic' queries (positive query) but cannot answer a negative query. Again consider the illustration in the above example. A fact-less fact containing the keys of tutors and students cannot answer a query like below,

1. Which teacher did not teach any student?

2. Which student was not taught by any teacher?

Why not? Because fact-less fact table only stores the positive scenarios (like student being taught by a tutor) but if there is a student who is not being taught by a teacher, then that student's key does not appear in this table, thereby reducing the coverage of the table.
Coverage fact table attempts to answer this - often by adding an extra flag column. Flag = 0 indicates a negative condition and flag = 1 indicates a positive condition. To understand this better, let's consider a class where there are 100 students and 5 teachers. So coverage fact table will ideally store 100 X 5 = 500 records (all combinations) and if a certain teacher is not teaching a certain student, the corresponding flag for that record will be 0.
What are incident and snapshot facts
A fact table stores some kind of measurements. Usually these measurements are stored (or captured) against a specific time and these measurements vary with respect to time. Now it might so happen that the business might not able to capture all of its measures always for every point in time. Then those unavailable measurements can be kept empty (Null) or can be filled up with the last available measurements. The first case is the example of incident fact and the second one is the example of snapshot fact.
What is aggregation and what is the benefit of aggregation?
A data warehouse usually captures data with same degree of details as available in source. The "degree of detail" is termed as granularity. But all reporting requirements from that data warehouse do not need the same degree of details.
To understand this, let's consider an example from retail business. A certain retail chain has 500 shops across Europe. All the shops record detail level transactions regarding the products they sale and those data are captured in a data warehouse.
Each shop manager can access the data warehouse and they can see which products are sold by whom and in what quantity on any given date. Thus the data warehouse helps the shop managers with the detail level data that can be used for inventory management, trend prediction etc.
Now think about the CEO of that retail chain. He does not really care about which certain sales girl in London sold the highest number of chopsticks or which shop is the best seller of 'brown breads'. All he is interested is, perhaps to check the percentage increase of his revenue margin across Europe. Or maybe year to year sales growth on Eastern Europe. Such data is aggregated in nature. Because Sales of goods in East Europe is derived by summing up the individual sales data from each shop in East Europe.
Therefore, to support different levels of data warehouse users, data aggregation is needed.
What is slicing-dicing?
Slicing means showing the slice of a data, given a certain set of dimension (e.g. Product) and value (e.g. Brown Bread) and measures (e.g. sales).
Dicing means viewing the slice with respect to different dimensions and in different level of aggregations.
Slicing and dicing operations are part of pivoting.
What is drill-through?
Drill through is the process of going to the detail level data from summary data.
Consider the above example on retail shops. If the CEO finds out that sales in East Europe has declined this year compared to last year, he then might want to know the root cause of the decrease. For this, he may start drilling through his report to more detail level and eventually find out that even though individual shop sales has actually increased, the overall sales figure has decreased because a certain shop in Turkey has stopped operating the business. The detail level of data, which CEO was not much interested on earlier, has this time helped him to pin point the root cause of declined sales. And the method he has followed to obtain the details from the aggregated data is called drill through.

A Learner's Blog.....

Labels

22 Oct 2015

Basic DWH Interview Questions.

8 comments: