DWH Concepts Interview Q&A

What is a Data-warehouse?
A data warehouse is a relational database that is designed for query and analysis rather than for transaction processing. It usually contains historical data derived from transaction data, but it can include data from other sources. It separates analysis workload from transaction workload and enables an organization to consolidate data from several sources. What are data marts? A data mart is a simple form of a data warehouse that is focused on a single subject (or functional area), such as Sales, Finance, or Marketing. Data marts are often built and controlled by a single department within an organization. Given their single-subject focus, data marts usually draw data from only a few sources. The sources could be internal operational systems, a central data warehouse, or external data. What is a star schema? A star schema model can be depicted as a simple star: a central table contains fact data and multiple tables radiate out from it, connected by the primary and foreign keys of the database. In a star schema implementation, Warehouse Builder stores the dimension data in a single table or view for all the dimension levels. What is Dimensional Modeling? This is the process of structuring and organizing data. These data structures are then typically implemented in a database management system. In addition to defining and organizing the data, data modeling may also impose constraints or limitations on the data placed within the structure. What is a snow Flake Schema? The snowflake schema represents a dimensional model which is also composed of a central fact table and a set of constituent dimension tables which are further normalized into sub-dimension tables. In a snowflake schema implementation, Warehouse Builder uses more than one table or view to store the dimension data. Separate database tables or views store data pertaining to each level in the dimension. What are the different methods of loading dimension tables? The data in the dimension tables may change over a period of time. Depending upon how you want to treat the historical data in dimension tables, there are three different ways of loading the (slowly) varying dimensions: Type one dimension: do not keep history. Hence update the record if found, else insert the data Type two dimension: do not update the existing record. Create a new record(with version number of change date as part of key) of the dimension, while retaining the old one Type three dimension: keeps more than one column for each changing attribute. The new value of the attribute is recorded in the existing record, but in an empty column Or Conventional Load - Before loading the data, all the Table constraints will be checked against the data. Direct load Faster Loading- All the Constraints will be disabled. Data will be loaded directly.Later the data will be checked against the table constraints and the bad data wont be indexed. Conventional and Direct load method are applicable for only oracle. What are aggregate tables? Aggregate tables, also known as summary tables, are fact tables which contain data that has been summarized up to a different level of detail.
What is the difference between OLAP and OLTP? Online Transaction Processing (OLTP) Application Oriented Up to date and consistent at all times Detailed data Isolated data Queries touch small amount of data Fast response time Updates are frequent Concurrency is the biggest performance concern Clerical Users OLTP targets specific process like ordering from an online store Performance sensitive Few accessed records per time Read/Update access No redundancy Databases size is usually around 100 MB to 100 GB OR Online transactional processing (OLTP) is designed to efficiently process high volumes of transactions, instantly recording business events (such as a sales invoice payment) and reflecting changes as they occur. Online analytical processing (OLAP) is designed for analysis and decision support, allowing exploration of often hidden relationships in large amounts of data by providing unlimited views of multiple relationships at any crosssection of defined business dimensions. Online Analytical Processing (OLAP) Used to analyze and forecast business needs Data is consistent only up to the last update Summarized data Integrated data Queries touch large amounts of data Slow response time Updates are less frequent Each report or query requires lot of resources Managerial/Business Users OLAP integrates data from different processes like (Ordering, processing, inventory, sales etc.,) Performance relaxed Large volumes accessed at a time Mostly read and occasional update Redundancy cannot be avoided Databases size is usually around 100 GB to a few TB
What is ETL? Extract, transform, and load (ETL) is a process in database usage and especially in data warehousing that involves: * Extracting data from outside sources * Transforming it to fit operational needs (which can include quality levels) * Loading it into the end target (database or data warehouse) What are the various ETL tools in the market? Oracle Warehouse Builder (OWB) 11gR1 Oracle Data Integrator & Services XI 3.0 Business Objects, SAP IBM Information Server (Datastage) 8.1 IBM SAS Data Integration Studio 4.2 SAS Institute PowerCenter 8.5.1 Informatica Elixir Repertoire 7.2.2 Elixir Data Migrator 7.6 Information Builders SQL Server Integration Services 10 Microsoft Talend Open Studio 3.1 Talend DataFlow Manager 6.5 Pitney Bowes Business Insight
What are various reporting tools in the market? SSRS(Microsoft),Businessobjects,Pentahoreporting,BIRTS,Cognos,Microstrategy,Actuate,Qlikview,Proclarity, Excel,Crystal reports,Data Integrator 8.12 Pervasive ,Transformation Server 5.4 IBM DataMirror ,Transformation Manager 5.2.2 ETL Solutions Ltd. ,Data Manager/Decision Stream 8.2 IBM Cognos ,Clover ETL 2.5.2 Javlin ,ETL4ALL 4.2 IKAN,,DB2 Warehouse Edition 9.1 IBM ,Pentaho Data Integration 3.0 Pentaho ,Adeptia Integration Server 4.9 Adeptia What is a Fact table? A fact table is a table, typically in a data warehouse, that contains the measures and facts (the primary data). A fact table typically has two types of columns: those that contain numeric facts (often called measurements), and those that are foreign keys to dimension tables. A fact table contains either detail-level facts or facts that have been aggregated. Fact tables that contain aggregated facts are often called summary tables. A fact table usually contains facts with the same level of aggregation.
What is a Dimension table? Dimension tables, also known as lookup or reference tables, contain the relatively static data in the warehouse. Dimension tables store the information you normally use to contain queries. Dimension tables are usually textual and descriptive and you can use them as the row headers of the result set. Examples are customers or products.
What is a look up table? Look up table is a referential table in which we will pass a key column from source table and we will get the required data once the key column matches. What are the modeling tools available in the market? Name some of them? Erwin Computer Associates Embarcadero Embarcadero Technologies Rational Rose IBM Corporation Power Designer Sybase Corporation Oracle Designer Oracle Corporation
What is normalization? First normal form, second normal form, Third normal form? Normalization is a series of steps followed to obtain a database design that allows for efficient access and storage of data. These steps reduce data redundancy and the chances of data becoming inconsistent. First Normal Form First Normal Form eliminates repeating groups by putting each into a separate table and connecting them with a one-tomany relationship. Two rules follow this definition:

Each table has a primary key made of one or several fields and uniquely identifying each record Each field is atomic, it does not contain more than one value
Second Normal Form Second Normal Form eliminates functional dependencies on a partial key by putting the fields in a separate table from those that are dependent on the whole key. In our example, "wagon_type", "empty_weight", "capacity"... only depends on "wagon_id" but not on "timestamp" field of the primary key, so this table is not in 2NF. In order to reach 2NF, we have to split the table in two in the way that each field of each table depends on all the fields of each primary key: Third Normal Form Third Normal Form eliminates functional dependencies on non-key fields by putting them in a separate table. At this stage, all non-key fields are dependent on the key, the whole key and nothing but the key. In our example, in the first table it is most likely that "empty_weight", "capacity", "designer" and "design_date" depend on "wagon_type", so we have to split this table in two
What is ODS? An operational data store (or "ODS") is a database designed to integrate data from multiple sources for additional operations on the data. The data is then passed back to operational systemsfor further operations and to the data warehouse for reporting.
What type of indexing mechanism do we need to use for a typical data warehouse? On the fact table it is best to use bitmap indexes. Dimension tables can use bitmap and/or the other types of clustered/non-clustered, unique/non-unique indexes.
Which columns go to the fact table and which columns go the dimension table? All elements before broken=fact measures? changing numeric fields..fact table Texual-dimension table Or Before broken into coloumns is going to the fact After broken going to dimensions
What is a level of granularity of a fact table? What does this signify? Granularity means nothing but it is a level of representation of measures and metrics. The lowest level is called detailed data and highest level is called summary data It depends of project we extract fact table significance
*How are the dimension tables designed? De-normalized, wide, short, use surrogate keys, contain additional date fields and flags? What are slowly changing dimensions? Slowly Changing Dimensions: Slowly changing dimensions are the dimensions in which the data changes slowly, rather than changing regularly on a time basis.
What are non-additive facts? (Inventory, Account balances in bank) Facts are generally additive. But in some business fact may be non-additive such as Inventory, Bank Balances.
What are conformed dimensions? A conformed dimension is a set of data attributes that have been physically implemented in multiple database tables using the same structure, attributes, domain values, definitions and concepts in each implementation.
What are SCD1, SCD2, and SCD3? There are three types of SCDs and you can use Warehouse Builder to define, deploy, and load all three types of SCDs.
Type 1 SCDs - Overwriting In a Type 1 SCD the new data overwrites the existing data. Thus the existing data is lost as it is not stored anywhere else. This is the default type of dimension you create. You do not need to specify any additional information to create a Type 1 SCD.
Type 2 SCDs - Creating another dimension record A Type 2 SCD retains the full history of values. When the value of a chosen attribute changes, the current record is closed. A new record is created with the changed data values and this new record becomes the current record. Each record contains the effective time and expiration time to identify the time period between which the record was active.
Type 3 SCDs - Creating a current value field A Type 3 SCD stores two versions of values for certain selected level attributes. Each record stores the previous value and the current value of the selected attribute. When the value of any of the selected attributes changes, the current value is stored as the old value and the new value becomes the current value.
Discuss the advantages and disadvantages of star and snowflake schema? Star schema Star Adv: Reduced Joins, Faster Query Operation Snowflake schema Snow Adv: Distributed data, Easier to obtain fact-less data e.g. Orders Shipped across one Quarter Star DisAdv: Bigger table sizes, Too many rows in Fact Snow DisAdv: More number of joins, Slower Query Table key any parent table. operation have one or more parenttables table In a star schema every dimension will have a primary Whereas in a snow flake schema, a dimension table will In a star schema, a dimension table will not have In snowflake schema dimension table will have parent
Hierarchies for the dimensions are stored in the Whereas hierachies are broken into separate tables in dimensional table itself in star schema. snow flake schema. These hierachies helps to drill down the data from topmost hierachies to the lowermost hierarchies. What is a junk dimension? An abstract dimension with the decodes for a group of low - cardinality flags and indicators , there by removing the flags from the fact is known as junk dimension
What are the differences between view and materialized view? Views store the SQL statement in the database and let you use it as a table. Everytime you access the view, the SQL statement executes. query result is not stored in the disk or database when we create view using any table, rowid of view is same as original table In case of View we always get latest data from the database Performance of View is less than Materialized view In case of view its only the logical view of table no separate copy of table will be available this is not required for views in database. in case of Materialized view we need to refresh the view for getting latest data. More performance than view but in case of Materialized view we get physically separate copy of table Materialized views stores the results of the SQL in table form in the database. SQL statement only executes once and after that everytime you run the query, the stored result set is used. Pros include quick query results Materialized view allow to store query result in disk or table in case of Materialized view rowid is different
In case of Materialized view we need extra trigger or some automatic method so that we can keep MV refreshed
Compare data warehousing top down and bottom-up approach? Top-down approach Bottom-up approach In the top-down design approach the, data warehouse is built first. The data marts are then created from the In the bottom-up design approach, the data marts are data warehouse created first to provide reporting capability. These data marts are then integrated to build a complete data warehouse Provides consistent dimensional views of data across This model contains consistent data marts and these data marts, as all data marts are loaded from the data data marts can be delivered quickly warehouse This approach is robust against business changes. As the data marts are created first, reports can be Creating a new data mart from the data warehouse is generated quickly. very easy This methodology is inflexible to changing The data warehouse can be extended easily to departmental needs during implementation phase accommodate new business units. It is just creating new data marts and then integrating with other data marts. It represents a very large project and the cost of The positions of the data warehouse and the data implementing the project is significant. marts are reversed in the bottom-up approach design.
What is factless fact schema? Factlessfactis a fact table without measures. It can view the number of occurring events Example : A number of accidents occurred in a month. Which kind of index is preferred in DWH? Index types depend very much on cardinality of the distinct values. High cardinality would require a regular B-tree index, whereas very low cardinality would require Bitmaps. Small tables may not require indices at all, since a full-table scan on such a table can be much faster that reading an index. It Actually depends on the nature of the Column on which you are goint to create an index. Bit Map: If the column is of type Flag means contains 1 or 0. Binary: If the column cintains numerical values. Partitions also can be created if column values contains only some list of values
what is the architecture of any data warehousing project? What is the flow? 1)The basic step of datawarehousing starts with datamodelling. i.e creation dimensions and facts. 2)datawarehouse starts with collection of data from source systems such as OLTP,CRM,ERPs etc 3)Cleansing and transformation process is done with ETL(Extraction Transformation Loading) tool. 4)by the end of ETL process target databases(dimensions,facts) are ready with data which accomplishes the business rules. 5)Now finally with the use of Reporting tools(OLAP) we can get the information which is used for decision support.
Explain Additive, semi-additive, non-additive facts?

Fact table can store different types of measures such as additive, non additive, semi additive. Additive As it name implied, additive measures are measures which can be added across all dimensions. Non-additive different from additive measures, non-additive measures are measures that cannot be added across all dimensions. Semi additive semi additive measures are measure that can be added across only some dimensions and not across other.
Difference between DWH and ODS ODS Transactions similar to those of an Online Transaction Processing System Contains current and near current data Typically detailed data only, often resulting in very large data volume Real-time and near real-time data loads Generally modeled to support rapid dataupdate Updated at the data field level Used for detailed decision making and operational reporting Knowledge workers (customer service representatives, line managers Data is volatile
DWH Transactions similar to those of anOnline Analytical System Queries process larger volumes of data Contains historical data Typically batch data loads Generally dimensionally modeled and tunes to optimise query performance Data is appended, not updated Used for long-term decision making and management reporting Strategic audience (executives, business unit management) Da ta is non-volatile
what are the steps to build the data warehouse? Or Extracting the transactional data from the data sources into a staging area Transforming the transactional data Loading the transformed data into a dimensional database Building pre-calculated summary values to speed up report generation Building (or purchasing) a front-end reporting tool Identifying Sources Identifying Facts Defining Dimensions Define Attribues Redefine Dimensions & Attributes Organise Attribute Hierarchy & Define Relationship Assign Unique Identifiers Additional convetions:Cardinality/Adding ratios 1 business modeling 2 data modeling 3 data from the source databases 4 Extration Transformation Loading 5 DataWare house (Data Marts)
How do you connect two fact tables? Is it possible? This is possible through conform dimension methodology. If a dimension table is connected to more then one Fact table is called confirm dimension what is the main difference between Inmon and Kimball philosophies of data warehousing? Bill Inmon's paradigm: Data warehouse is one part of the overall business intelligence system. An enterprise has one data warehouse, and data marts source their information from the data warehouse. In the data warehouse, information is stored in 3rd normal form. Ralph Kimball's paradigm: Data warehouse is the conglomerate of all data marts within the enterprise. Information is always stored in the dimensional model What is meant by metadata in context of a data warehouse and how it is important? Metadata or Meta Data Metadata is data about data. Examples of metadata include data element descriptions, data type descriptions, attribute/property descriptions, range/domain descriptions, and process/method descriptions. The repository environment encompasses all corporate metadata resources: database catalogs, data dictionaries, and navigation services. Metadata includes things like the name, length, valid values, and description of a data element. Metadata is stored in a data dictionary and repository. It insulates the data warehouse from changes in the schema of operational systems. Metadata Synchronization The process of consolidating, relating and synchronizing data elements with the same or similar meaning from different systems. Metadata synchronization joins these differing elements together in the data warehouse to allow for easier access
what is the role of surrogate keys in data warehouse and how will u generarte them? A surrogate key is a simple Primary key which maps one to one with a Natural compound Primary key. The reason for using them is to alleviate the need for the query writer to know the full compound key and also to speed query processing by removing the need for the RDBMS to process the full compound key when considering a join. The Surrogate key role is it links the Dimension and Fact table. It avoids smart keys and Production keys *45. how data in data warehouse stored after data has been extracted and transformed from heterogeneous sources? why fact table is in normal form? Foreign keys of facts tables are primary keys of Dimension tables. It is clear that fact table contains columns which are primary key to other table that itself make normal form table. Or Basically the fact table consists of the Index keys of the dimension/ook up tables and the measures. so when ever we have the keys in a table .that itself implies that the table is in the normal form. what is the difference between E-R Modelling and Dimensional modelling? Basic difference is E-R modeling will have logical and physical model. Dimensional model will have only physical model. E-R modeling is used for normalizing the OLTP database design.Dimensional modeling is used for de-normalizing the ROLAP/MOLAP design. Can a dimension table contain numeric values? Yes dimension can have numeric values, that is surrogate Key which holds numeric value for unique identification of records in the dimension But those datatype will be char (only the values can numeric/char) what are the methodologies of data warehousing? Or Every company has methodology of their own. But to name a few SDLC Methodology, AIM methodology are stardadly used. Other methodologies are AMM, World class methodology and many more. what is a surrogate key? Where we use it explain with examples? A surrogate key is a unique identifier in database either for an entity in the modeled word or an object in the database. Application data is not used to derive surrogate key. Surrogate key is an internally generated key by the current system and is invisible to the user. As several objects are available in the database corresponding to surrogate, surrogate key can not be utilized as primary key. For example, a sequential number can be a surrogate key. *39. tell me what would be the size of your warehouse project? Regarding the methodologies in the Datawarehousing . They are mainly 2 methods. Ralph Kimbell Model- Kimbell model always structed as Denormalised structure. 2. Inmon Model.- Inmon model structed as Normalised structure Depends on the requirements of the company anyone can follow the company's DWH will choose the one of the above models.
What is semi additive and fully additive measures? Semiadditive A semiadditive measure can be aggregated along some, but not all, dimensions that are included in the measure group that contains the measure. For example, a measure that represents the quantity available for inventory can be aggregated along a geography dimension to produce a total quantity available for all warehouses, but the measure cannot be aggregated along a time dimension because the measure represents a periodic snapshot of quantities available. Aggregating such a measure along a time dimension would produce incorrect results. Nonadditive A nonadditive measure cannot be aggregated along any dimension in the measure group that contains the measure. Instead, the measure must be individually calculated for each cell in the cube that represents the measure. For example, a calculated measure that returns a percentage, such as profit margin, cannot be aggregated from the percentage values of child members in any dimension.
What are the differences between star schema and snow-flake schema? Star schema Snow-flake schema Star schema is highly denormalized It is normalized Data access latency is less Data access latency is more when compared to star Size of DWH is larger than snow-flake as it is Size of DWH is less than star schema denormalized It is good as per performance It is better when memory utilization is a major concern Reduces the no. of joins between tables Minimum storage space, min. data redundancy Requires more amount of storage space Requires more joins to get \information from look up tables hence slow performance Where we use star schema & where snow flake? if PERFORMANCE is the priority than go for star schema, since here dimension tables are DE NORMALIZED if MEMORY SPACE is the priority than go for snowflake schema, since here dimension tables are NORMALIZED
What is ODS? What data loaded from it? What is DW architecture? ODSOperational Data Source, Normally in 3NF form. Data is stored with least redundancy. General architecture of DWH OLTP System ODS DWH( Denormalized Star or Snowflake, vary case to case)

DWH Concepts Interview Q&A

Загружено:

Сведения о документе

Исходное описание:

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

DWH Concepts Interview Q&A

Загружено:

Авторское право:

Доступные форматы

What is a Data-warehouse?

Explain Additive, semi-additive, non-additive facts?

Вам также может понравиться