Вы находитесь на странице: 1из 20

Informatica B2B Data Transformation is a high-performance software that converts

structured and unstructured data to and from more broadly consumable data formats to
support business-to-business and multi-enterprise transactions. This single, unified codeless
environment supports virtually any-to-any data transformation and is accessible to multiple
business levels within the organization: analysts, developers, and programmers.

B2B will let you parse and read unstructured data such as PDF, EXCEL, HTML etc... and it has
the capability to read binary data such as Messages, EBCDIC File etc... and has a very large
list of supported formates.

B2B Data Transformation Studio is the Developer tool, by which the parsing of (reading) the
unstructured data is done. B2B mostly gives the output as an XML file.

B2B Data Transformation is integrated with Informatica PowerCenter using a Transformation


"Unstructured Data Transformation", This transformation can receive the output of B2B Data
Transformation studion and load into any Target supported by Powercenter.

++++++++++++++++++++++++++++++++++++++++++++

Informatica parallelism: There are different Types of Informatica Partitions, eg.

1. Database partitioning
2. Hash auto-keys
3. Hash user keys
4. Key range
5. Pass-through
6. Round-robin
Database partitioning:
The Integration Service queries the database system for table partition
information. It reads partitioned data from the corresponding nodes in the
database.
Pass-through:
The Integration Service processes data without redistributing rows among
partitions. All rows in a single partition stay in the partition after crossing a
pass-through partition point. Choose pass-through partitioning when we want to
create an additional pipeline stage to improve performance, but do not want to
change the distribution of data across partitions.
Round-robin:
The Integration Service distributes data evenly among all partitions. Use round-
robin partitioning where we want each partition to process approximately the
same numbers of rows i.e. load balancing.
Hash auto-keys:
The Integration Service uses a hash function to group rows of data among
partitions. The Integration Service groups the data based on a partition key. The
Integration Service uses all grouped or sorted ports as a compound partition
key. We may need to use hash auto-keys partitioning at Rank, Sorter, and
unsorted Aggregator transformations.
Hash user keys:
The Integration Service uses a hash function to group rows of data among
partitions. We define the number of ports to generate the partition key.
Key range:
The Integration Service distributes rows of data based on a port or set of ports
that we define as the partition key. For each port, we define a range of values.
The Integration Service uses the key and ranges to send rows to the appropriate
partition. Use key range partitioning when the sources or targets in the pipeline
are partitioned by key range.
++++++++++++++++++++++++++++++++++++++++++++

An Oracle relational database system is designed to take advantage of the parallel


architecture. The database is a multi-process system as set up in UNIX systems and is a
multi-threaded application in the Windows architecture. In general, the databases are
accessed by a large number of concurrent users or connections. Many of these users, with
their own data and instructions, take advantage of the multi-processor availability to
perform database processing. Also, a single user task, such as a SQL query, can be
paralleled to achieve higher speed and throughput by using multiple processors.

The relational model consists of structured tables with rows and columns. Usually, the SQL
query aims at extracting or updating target data, which is a set of rows and columns based
on a given condition. Typically, any SQL database operation gets divided into multiple
database sub-operations such as SELECTION, JOIN, GROUP, SORT, PROJECTION, etc. Thus,
the sub-operations become excellent candidates for simultaneous or parallel execution. This
makes the RDBMS system ideal for the implementation of parallel processing software.

Databases have a component called the query optimizer that selects a sequence of inputs,
joins, and scans to produce the desired output table or data set. The query optimizer is
aware of the underlying hardware architecture and finds a suitable parallel execution path.
Hence, from the database perspective, parallel execution is useful for many types of
operations that access significant amounts of data.

Generally, parallel execution improves performance for:

* Queries.

* Creation of large indexes.

* Bulk INSERTs, UPDATEs, and DELETEs.

Parallel Execution Mechanism

A SQL statement is executed in parallel using multiple parallel processes. The user process
acts as the parallel execution coordinator (PEC), and it dispatches the statement to several
parallel execution servers and coordinates the end results. The results from all of the server
processes are sent back to the user. The basic unit of work in parallelism is called a granule.
Oracle divides the operation being paralleled, such as a table scan, table update, or index
creation, into granules. Parallel processes execute the operation one granule at a time.

Granules for Parallelism

There are two types of granules: block ranges and partition ranges:

* Block Range Granules: These are the ranges of physical blocks from a table. Block range
granules are the basic unit of most parallel operations. The size of the object table and the
degree of parallelism (DOP) determine the size of the granule at runtime. Block range
granules do not depend on static pre-allocation of tables or indexes. During the computation
of the granules, Oracle takes the DOP into account and tries to assign granules from
different data files to each of the parallel execution servers, avoiding contention whenever
possible. Thus, the tables involved in the query are divided dynamically into granules and a
single parallel execution server reads each granule. PEC manages this process.

* Partition Granules: A query server process works on an entire partition or sub-partition of a


table or index. Partition granules are the basic unit of parallel index range scans and of
parallel operations that modify multiple partitions of a partitioned table or index. These
operations include parallel update, parallel delete, parallel creation of partitioned indexes,
and parallel creation of partitioned tables. This is collectively known as parallel data
manipulation language or PDML.

+++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++

A data model can be sometimes referred to as a data structure, especially in the context of
programming languages. Data models are often complemented by function models,
especially in the context of enterprise models.

Three perspectives

The ANSI/SPARC three level architecture. This shows that a data model can be an external
model (or view), a conceptual model, or a physical model. This is not the only way to look at
data models, but it is a useful way, particularly when comparing models.[4]

A data model instance may be one of three kinds according to ANSI in 1975:[5]

Conceptual schema : describes the semantics of a domain, being the scope of the
model. For example, it may be a model of the interest area of an organization or
industry. This consists of entity classes, representing kinds of things of significance in
the domain, and relationships assertions about associations between pairs of entity
classes. A conceptual schema specifies the kinds of facts or propositions that can be
expressed using the model. In that sense, it defines the allowed expressions in an
artificial 'language' with a scope that is limited by the scope of the model. The use of
conceptual schema has evolved to become a powerful communication tool with
business users. Often called a subject area model (SAM) or high-level data model
(HDM), this model is used to communicate core data concepts, rules, and definitions
to a business user as part of an overall application development or enterprise
initiative. The number of objects should be very small and focused on key concepts.
Try to limit this model to one page, although for extremely large organizations or
complex projects, the model might span two or more pages.[6]
Logical schema : describes the semantics, as represented by a particular data
manipulation technology. This consists of descriptions of tables and columns, object
oriented classes, and XML tags, among other things.
Physical schema : describes the physical means by which data are stored. This is
concerned with partitions, CPUs, tablespaces, and the like.

A database model is a specification describing how a database is structured and used.


Several such models have been suggested. Common models include:

Flat model

Hierarchical model

Network model

Relational model

Flat model: This may not strictly qualify as a data model. The flat (or table) model
consists of a single, two-dimensional array of data elements, where all members of a
given column are assumed to be similar values, and all members of a row are
assumed to be related to one another.
Hierarchical model: In this model data is organized into a tree-like structure, implying
a single upward link in each record to describe the nesting, and a sort field to keep
the records in a particular order in each same-level list.
Network model: This model organizes data using two fundamental constructs, called
records and sets. Records contain fields, and sets define one-to-many relationships
between records: one owner, many members.
Relational model: is a database model based on first-order predicate logic. Its core
idea is to describe a database as a collection of predicates over a finite set of
predicate variables, describing constraints on the possible values and combinations
of values.

Concept-oriented model

Star schema

Data properties

Some important properties of data for which requirements need to be met are:

definition-related properties[4]
o relevance: the usefulness of the data in the context of your business.
o clarity: the availability of a clear and shared definition for the data.
o consistency: the compatibility of the same type of data from different sources.

Some important properties of data.[4]

content-related properties
o timeliness: the availability of data at the time required and how up to date
that data is.
o accuracy: how close to the truth the data is.
properties related to both definition and content
o completeness: how much of the required data is available.
o accessibility: where, how, and to whom the data is available or not available
(e.g. security).
o cost: the cost incurred in obtaining the data, and making it available for use.

Data Accuracy dimension of Data Quality:

Accuracy of data is the degree to which data correctly reflects the real world object OR an
event being described.

Examples of Data Accuracy

The address of customer in the customer database is the real address.

The temperature recorded in the thermometer is the real temperature.


The bank balance in the customer's account is the real value customer deserves from
the Bank.

Data Completeness dimension of Data Quality

Completeness of data is the extent to which the expected attributes of data are provided.

For example, a customer data is considered as complete if:

All customer addresses, contact details and other information are available.
Data of all customers is available.

Data Completeness definition is the 'expected completeness'. It is possible that data is not
available, but it is still considered completed, as it meets the expectations of the user. Every
data requirement has 'mandatory' and 'optional' aspects. For example customer's mailing
address is mandatory and it is available and because customers office address is optional, it
is OK if it is not available.

Data can be complete, but inaccurate:

All the customers' addresses are available, but many of them are not correct.
The health records of all patients have 'last visit' date, but some of it contains the
future dates.

Data Consistency dimension of quality of data

Consistency of Data means that data across the enterprise should be in synch with each
other.

Examples of data in-consistency are:

An agent is inactive, but he still has his disbursement account active.


A credit card is cancelled, and inactive, but the card billing status shows 'due'.
Data can be accurate (i.e., it will represent what happened in real world), but still
inconsistent.

An Airline promotion campaign closure date is Jan 31, and there is a passenger ticket
booked under the campaign on Feb. 2.

Data is inconsistent, when it is in synch in the narrow domain of an organization, but not in
synch across the organization. For example:

Collection management system has the Cheque status as 'cleared', but in the
accounting system, the money is not shown being credited to the bank account.
Reason for this kind of inconsistency is that system interfaces are synchronized
during the end-of-day batch runs.

Data can be complete, but inconsistent

Data for all the packets dispatched from New York to Chicago are available., but some
of the packages are also shown as 'under bar-coding' status.

Data Timeliness

'Data delayed' is 'Data Denied'

The timeliness of data is extremely important. This is reflected in:

Companies are required to publish their quarterly results with in a given frame of
time.
Customers service providing up-to date information to the customers.
Credit system checking on the credit card account activity.

The timeliness depends on user expectation. An online availability of data could be required
for room allocation system in Hospitality, but an overnight data is fine for a billing system.

Example of Data not being timely:

The courier package status is delivered, but it will be updated in the system only in
the night batch run. This means that online status will not be available.
The financial statements of a company are published one month after the year-end.
The census data is available two years after the census is done.

Data Auditability

Auditability means that any transaction, report, accounting entry, bank statement etc. can
be tracked to its originating transaction. This would need a common identifier, which should
stay with a transaction as it undergoes Transformation, aggregation and reporting.

Examples of non-auditable data:

A car chassis number cannot be linked to the part number supplied by an ancillary.
A surgery report cannot be linked to the Doctor ID of preliminary diagnosis OR the
pathologist ID.
Different Types of Dimensions and Facts in Data Warehouse
Posted on June 5, 2012 by bintelligencegroup

Dimension -

A dimension table typically has two types of columns, primary keys to fact tables and
textual\descreptive data.

Fact -
A fact table typically has two types of columns, foreign keys to dimension tables and
measures those that contain numeric facts. A fact table can contain facts data on detail or
aggregated level.

Types of Dimensions -

Slowly Changing Dimensions:

Attributes of a dimension that would undergo changes over time. It depends on the
business requirement whether particular attribute history of changes should be preserved in
the data warehouse. This is called a Slowly Changing Attribute and a dimension containing
such an attribute is called a Slowly Changing Dimension.

Rapidly Changing Dimensions: A dimension attribute that changes frequently is a


Rapidly Changing Attribute. If you dont need to track the changes, the Rapidly Changing
Attribute is no problem, but if you do need to track the changes, using a standard Slowly
Changing Dimension technique can result in a huge inflation of the size of the dimension.
One solution is to move the attribute to its own dimension, with a separate foreign key in the
fact table. This new dimension is called a Rapidly Changing Dimension.

Junk Dimensions: A junk dimension is a single table with a combination of different and
unrelated attributes to avoid having a large number of foreign keys in the fact table. Junk
dimensions are often created to manage the foreign keys created by Rapidly Changing
Dimensions.

Inferred Dimensions:While loading fact records, a dimension record may not yet be ready.
One solution is to generate an surrogate key with Null for all the other attributes. This should
technically be called an inferred member, but is often called an inferred dimension.

Conformed Dimensions: A Dimension that is used in multiple locations is called a


conformed dimension. A conformed dimension may be used with multiple fact tables in a
single database, or across multiple data marts or data warehouses.

Degenerate Dimensions: A degenerate dimension is when the dimension attribute is


stored as part of fact table, and not in a separate dimension table. These are essentially
dimension keys for which there are no other attributes. In a data warehouse, these are often
used as the result of a drill through query to analyze the source of an aggregated number in
a report. You can use these values to trace back to transactions in the OLTP system.

Role Playing Dimensions: A role-playing dimension is one where the same dimension key
along with its associated attributes can be joined to more than one foreign key in the
fact table. For example, a fact table may include foreign keys for both Ship Date and
Delivery Date. But the same date dimension attributes apply to each foreign key, so you can
join the same dimension table to both foreign keys. Here the date dimension is taking
multiple roles to map ship date as well as delivery date, and hence the name of Role Playing
dimension.

Shrunken Dimensions: A shrunken dimension is a subset of another dimension. For


example, the Orders fact table may include a foreign key for Product, but the Target fact
table may include a foreign key only for ProductCategory, which is in the Product table, but
much less granular. Creating a smaller dimension table, with ProductCategory as its primary
key, is one way of dealing with this situation of heterogeneous grain. If the Product
dimension is snowflaked, there is probably already a separate table for ProductCategory,
which can serve as the Shrunken Dimension.

Static Dimensions: Static dimensions are not extracted from the original data source, but
are created within the context of the data warehouse. A static dimension can be loaded
manually for example with Status codes or it can be generated by a procedure, such as
a Date or Time dimension.

Types of Facts -

Additive:Additive facts are facts that can be summed up through all of the dimensions in
the fact table. A sales fact is a good example for additive fact.

Semi-Additive:Semi-additive facts are facts that can be summed up for some of the
dimensions in the fact table, but not the others.
Eg: Daily balances fact can be summed up through the customers dimension but not
through the time dimension.

Non-Additive:Non-additive facts are facts that cannot be summed up for any of the
dimensions present in the fact table.
Eg: Facts which have percentages, ratios calculated.

Factless Fact Table:

In the are
tables realcalled
world,Factless
it is possible
Fact to have a fact table that contains no measures or facts. These
tables.
Eg: A fact intable
measures which But
this table. hasstill
only product
you key
can get theand date products
number key is a sold
factless
overfact. There
a period of are no
time.
Based on the above classifications, fact tables are categorized into two:
Cumulative: This type of fact table describes what has happened over a period of time. For
example, this fact table may describe the total sales by product by store by day. The facts
for this type of fact tables are mostly additive facts. The first example presented here is a
cumulative fact table.

Snapshot:

This type of fact table describes the state of things in a particular instance of time, and
usually includes more semi-additive and non-additive facts. The second example presented
here is a snapshot fact table.
How does Push-Down Optimization work?
One can push transformation logic to the source or target database using pushdown
optimization. The Integration Service translates the transformation logic into SQL queries
and sends the SQL queries to the source or the target database which executes the SQL
queries to process the transformations. The amount of transformation logic one can push to
the database depends on the database, transformation logic, and mapping and session
configuration. The Integration Service analyzes the transformation logic it can push to the
database and executes the SQL statement generated against the source or target tables,
and it processes any transformation logic that it cannot push to the database.

1. Why is data warehousing a query centric?


Data Warehousing is primarily used to organize data so that queries for analysis and decision support can be
answered quickly.
All operations performed on the data use a query like update, insert, select operations. Thus it is query centric.
2. Difference between OLTP and DW/BI Database.

OLTP vs DW
OLTP DW
Transaction Oriented Subject Oriented
Current data Historical data
Normalized De normalized
Based on ER modeling Based on Dimensional Modeling
Detail Data Summarized data
Number of users: More Less
Performance factor measured by Performance factor measured by query
transaction out put
Read/Update Read/Batch Update
Frequent access Adhoc access
run business optimize business

3. Difference between Star and Snowflake schema

Snowflake Schema Star Schema Hide All

Joins: Higher number of Joins Fewer Joins hide

More complex queries and Less complex queries and


Ease of Use: hide
hence less easy to understand easy to understand

Less no. of foreign keys


More foreign keys-and hence
Query Performance: and hence lesser query hide
more query execution time
execution time

Has redundant data and


Ease of No redundancy and hence more
hence less easy to hide
maintenance/change: easy to maintain and change
maintain/change

Good to use for small Good for large


Type of Datawarehouse: hide
datawarehouses/datamarts datawarehouses

2 Normal Denormalized
DimTable Normalization: 3 Normal Form
Form

4. What are Surrogate Keys


Surrogate Key
1. surrogate key is a substitution for the natural primary key
2. It is just a unique identifier or number for each row that can be used for the primary key to the table.
The only requirement for a surrogate primary key is that it is unique for each row in the table.
3. Data warehouses typically use a surrogate, (also known as artificial or identity key), key for the
dimension tables primary keys.
4. It is useful because the natural primary key (i.e. Customer Number in Customer table) can change
and this makes updates more difficult.

5. What are Slowly Changing Dimensions


SCD
SCD: data in the dimension changes very rarely,
Mainly SCD 3 types
1)Scd-1: Here The Previous Data Overwrite By Current Data
2)Scd-2: Here Just Add The Additional Records (Maintain History)
In Scd2 3 Types 1) Versioning 2) Flag value 3) Date Range
Versioning: Provide version numbers for old and new record
Flag value: Attach flags for old and new record (0 old 1 new)
Date Range: create date time and update time stamp
3)Scd-3: Here Maintains Just Previous And Current Data.

6. What IS granularity in Fact Tables?


7. Granularity The first step in designing a fact table is to determine the granularity of the fact
table. By granularity; we mean the lowest level of information that will be stored in the fact table. This
constitutes two steps:

Determine which dimensions will be included. Determine where along the hierarchy of each dimension the
information will be kept.

The determining factor usually goes back to the requirements

8. Difference between Inmon and Kimball approaches

Kimball vs Inmon

Bill inmon's approach:


---------------------------

Approach is "TOP DOWN approach". ; means Emphasize the data warehouse.

Starts with designing an enterprise model for datawarehouse.

Multi tier architecture - comprise of staging area,a Data warehouse and dependent data marts.

Persistant Staging area.

Warehouse is enterprise oriented ; and mart is functional specific.

Warehouse has atomic level data and marts have summary data.

Warehouse uses normalized enterpeise model. marts use subject specific dimensional model.

Can query both warehouse and marts.

Ralph Kimbal's approach:


-------------------------------
Approach is "BOTTOM UP approach". ;means Emphasize the data marts.

Starts with designing dimensional model for a datamart.


Flat architecture - consists staging area and data marts.

Non Persistant staging area

Marts contain both atomic and summary data.

marts are designed for both enterprises and dimensional model.

marts consists single star schema.

marts use confirmed dimensions

8. Data warehouse architecture


Components
1.Source system
2.Data staging area: The data staging area is the portion of the data warehouse restricted to extracting,
cleaning, matching and loading data from multiple legacy systems
3.Data warehouse database
4.Data Marts: Smaller DW concentrated on a single subject area.
5.ETL
6.BI
7. Meta Data and Meta data repository
10. Data Profiling: (validating the data)
It is a method used for looking /examining the quality, content, scope and structure of data to develop an
etl system. "Data profiling" helps us understand this data before the pretty business performance trend
reports can be created for the data warehouse user.

Resolve missing values/erroneous values


Discover formats and patterns in your data

11. Difference between technical, operational and business metadata.

1. Business Meta Data


1. Business terms and definitions for tables and columns
2. Subject area names
3. Query and report definitions
4. Report mappings
5. Data Stewards
Business metadata (data warehouse metadata, front room metadata, operational metadata) - this type of metadata
stores business definitions of the data, it contains high-level definitions of all fields present in the data
warehouse, information about cubes, aggregates, datamarts.
Business metadata is mainly addressed to and used by the data warehouse users, report authors (for ad-hoc
querying), cubes creators, data managers, testers, analysts.

Typically, the following information needs to be provided to describe business metadata:


DW Table Name
DW Column Name
Business Name - short and desctiptive header information
Definition - extended description with brief overview of the business rules for the field

2. Technical Metadata
1. Physical table and column names
2. Data mapping and transformation logic
3. Source systems
4. Foreign keys and indexes
5. Security
6. ETL process names
Technical metadata (ETL process metadata, back room metadata, transformation metadata) is a representation of
the ETL process. It stores data mapping and transformations from source systems to the data
warehouse and is mostly used by datawarehouse developers, specialists and ETL modelers.

Target Database - Data Warehouse instance


Source Tables - one or more tables which are input to calculate a value of the field
Source Columns - one or more columns which are input to calculate a value of the field
Target Table - target DW table and column are always single in a metadata repository.
Target Column - target DW column

Operational Meta data

Operational meta data, unlike information stored in the meta data repository, is referenced at a row level of
granularity in the data warehouse.
Operational meta data provides a detailed row level explanation of actual information content in
the data warehouse.
1. Load Cycle Identifier
2. Current Flag Indicator
3. Load Date
4. Update Date
5. Operational System(s) Identifier
6. Active in Operational System Flag
7. Confidence Level Indicator.

12. Difference between Dimensional modelling vs Data modelling vs ER modeling

ER modeling is just a representation, which is more towards entity relationship.


Dimensional modelling sometimes mistaken as the ER modelling but no it is wrong, DM is just modelling which is
more flexible for the user perspective.
In ER it is not mapped unlike DM for creating schemas and moreover it does not also act to convert normalize data
in to denormalize form.

ER-Modeling Dimensional modeling


Opted for OLTP Opted for data warehouse
We can do the manipulations Only retrieving data
(DML commands)
Normalized format Denormalized format
Performance is low Performance is high

What is Difference between E-R Modeling and Dimensional Modeling?

Basic difference is E-R modeling will have logical and physical model. Dimensional model will have only physical
model. E-R modeling is used for normalizing the OLTP database design.

Dimensional modeling is used for de-normalizing the ROLAP/MOLAP design. Adding to the point:
E-R modeling revolves around the Entities and their relationships to capture the overall process of the system.

Dimensional model / Multidimensional Modeling revolves around Dimensions (point of analysis) for decision-making
and not to capture the process.

In ER modeling the data is in normalized form. So more number of Joins, which may adversely affect the system
performance. Whereas in Dimensional Modeling the data is denormalized, so less number of joins, by which system
performance will improve.

13. What are conformed, junk, degenerate, non-conformed dimensions?

Dimension tables contain attributes that describe fact records in the fact table. Some of these attributes provide
descriptive information; others are used to specify how fact table data should be summarized to provide useful
information to the analyst. Dimension tables contain hierarchies of attributes that aid in summarization. For
example, a dimension containing product information would often contain a hierarchy that separates products into
categories such as food, drink, and nonconsumable items, with each of these categories further subdivided a
number of times until the individual product SKU is reached at the lowest level.

Junk dimension: the column which we are using rarely or not used, these columns are formed a dimension is
called junk dimension. A "junk" dimension is a collection of random transactional codes, flags and/or text
attributes that are unrelated to any particular dimension. The junk dimension is simply a structure that provides a
convenient place to store the junk attributes

Degenerative dimension: The columns which we use in dimension are degenerative dimension. A degenerate
dimension is data that is dimensional in nature but stored in a fact table.

Ex.Emp table has empno, ename, sal, job, deptno

But We are talking only the column empno, ename from the EMP table and forming a dimension this is called
degenerative dimension.

Conformed dimension : Conformed dimensions are the dimensions, which can be used across multiple Data
Marts in combination with multiple facts tables accordingly

They are dimension tables in a star schema data mart that adhere to a common structure, and therefore allow
queries to be executed across star schemas. For example, the Calendar dimension is commonly needed in most
data marts. By making this Calendar dimension adhere to a single structure, regardless of what data mart it is used
in your organization, you can query by date/time from one data mart to another to another. Conformed Dimensions
are the one if they share one or more attributes whose values are drawn from the same domains. if a table is used
as a dimension table for more than one fact tables. then the dimension table is called conformed dimesions. The
dimensions which is used more than one fact table is called conformed dimensions.

Role-playing dimension

Dimensions are often recycled for multiple applications within the same database. For instance, a "Date" dimension
can be used for "Date of Sale", as well as "Date of Delivery", or "Date of Hire". This is often referred to as a "role-
playing dimension".

Uniqueness

Data uniqueness is a data quality dimension used to describe the positive outcome of solving and avoiding
unwanted data duplication.
Definition of validity

This refers to the extent to which a measurement does what it supposed to do. Data need not only to be reliable but
also true and accurate. If a measurement is valid, it is also reliable. But if is reliable, it may or may not be valid.
Validity of an instrument is easy to determine if one is dealing with information that can be quantified. For example
in growth monitoring, the height and weight of a child is easy to determine. However, when handling qualitative
information such as feelings, likes, dislikes, opinions etc, validity is more difficult to determine.

Defination of reliability

Reliability refers to the consistence, stability, or dependability of the data. Whenever an investigator measures a
variable, he or she wants to be sure that the measurement provides dependable and consistent results. A reliable
measurement is one that if repeated a second time will give the same results as it did the first time. If the results
are different, then the measurement is unreliable. It is easier to determine reliability when dealing with information
that can be quantified. For example it is easier to determine the reliability of an instrument used to measure the
performance of mathematics in a form 2 class, than an instrument used get their music ability. Reliability of an
instrument is increased by identifying the precise data needed and repeated use of the instrument in field testing.

In surveys, reliability problems commonly result when the respondents do not understand the question, are asked
about something they do not clearly recall, or are asked about something of little relevance to them. Data obtained
for instance from educational statistics or records can be unreliable if educational administrators fail to record
information or make frequent errors in entering the data.

Non-conformed dimensions:

What are facts and fact table?

Facts, are the verb. An entry in a fact table marks a discrete event that happens to something from the dimension
table. A product sale would be recorded in a fact table. The event of the sale would be noted by what product was
sold, which employee sold it, and which customer bought it. Product, Employee, and Customer are all dimensions
that describe the event, the sale.

In addition fact tables also typically have some kind of quantitative data. The quantity sold, the price per item, total
price, and so on.

A fact table might contain business sales events such as cash register transactions or the contributions and
expenditures of a nonprofit organization. Fact tables usually contain large numbers of rows, sometimes in the
hundreds of millions of records when they contain one or more years of history for a large organization.

A key characteristic of a fact table is that it contains numerical data (facts) that can be summarized to provide
information about the history of the operation of the organization. Each fact table also includes a multipart index
that contains as foreign keys the primary keys of related dimension tables, which contain the attributes of the fact
records. Fact tables should not contain descriptive information or any data other than the numerical measurement
fields and the index fields that relate the facts to corresponding entries in the dimension tables.

Fact table typically has two types of columns: those that contain numeric facts (often called measurements), and
those that are foreign keys to dimension tables.

A fact table contains either detail-level facts or facts that have been aggregated. Fact tables that contain
aggregated facts are often called summary tables. A fact table usually contains facts with the same level of
aggregation.

14. What are additive facts, semi additive facts and non-additive facts?
Additive: Additive facts are facts that can be summed up through all of the dimensions in the fact table. . A
common example of this is sales

Semi-Additive: Semi-additive facts are facts that can be summed up for some of the dimensions in the fact table,
but not the others. An example of this is inventory levels, where you cannot tell what a level means simply by
looking at it.

Non-Additive: Non-additive facts are facts that cannot be summed up for any of the dimensions present in the fact
table. An example of this is averages.

Examples to illustrate each of the three types of facts. The first example assumes that we are a retailer, and we
have a fact table with the following columns:

Date
Store
Product
Sales_Amount

The purpose of this table is to record the sales amount for each product in each store on a daily basis.
Sales_Amount is the fact. In this case, Sales_Amount is an additive fact, because you can sum up this fact along any
of the three dimensions present in the fact table -- date, store, and product. For example, the sum of Sales_Amount
for all 7 days in a week represent the total sales amount for that week.

Say we are a bank with the following fact table:


Date
Account
Current_Balance
Profit_Margin

The purpose of this table is to record the current balance for each account at the end of each day, as well as the
profit margin for each account for each day. Current_Balance and Profit_Margin are the facts. Current_Balance is a
semi-additive fact, as it makes sense to add them up for all accounts (what's the total current balance for all
accounts in the bank?), but it does not make sense to add them up through time (adding up all current balances for
a given account for each day of the month does not give us any useful information). Profit_Margin is a non-additive
fact, for it does not make sense to add them up for the account level or the day level.

Note: Non-additive facts are facts that cannot be summed up for any of the dimensions present in the fact table.
However they are not considered as useless. If there is a change in dimensions the same facts can be useful.

15. Explain the different types of hierarchies (Balanced, unbalanced, ragged)

A hierarchy defines relationships among a set of attributes that are grouped by levels in the dimension of a cube
model. These relationships between levels are usually defined with a functional dependency to improve
optimization. Multiple hierarchies can be defined for a dimension of a cube model.

Hierarchies document the relationships between levels in a dimension. When you define a hierarchy, DB2 Data
Warehouse Edition creates a functional dependency by default between consecutive levels, such that the level key
attributes from one level are functionally dependent on the level key attributes in the lower level. For example, in a
Region hierarchy that is defined with the following levels: Region, State, and City, DB2 Data Warehouse Edition
creates two functional dependencies. One functional dependency to show that City determines State, and a second
functional dependency to show that State determines Region

Balanced

A hierarchy with meaningful levels and branches that have a consistent depth. Each level's
logical parent is in the level directly above it. A balanced hierarchy can represent time where the meaning
and depth of each level, such as Year, Quarter, and Month, is consistent. They are consistent because each
level represents the same type of information, and each level is logically equivalent. Figure 1 shows an
example of a balanced time hierarchy.
Figure 1. Balanced hierarchy. Example of a balanced hierarchy
Unbalanced
A hierarchy with levels that have a consistent parent-child relationship but have a logically
inconsistent levels. The hierarchy branches also can have inconsistent depths. An unbalanced hierarchy
can represent an organization chart. For example, Figure 2 shows a chief executive officer (CEO) on the top
level of the hierarchy and at least two of the people that might branch off below including the chief
operating officer and the executive secretary. The chief operating officer has more people branching off
also, but the executive secretary does not. The parent-child relationships on both branches of the hierarchy
are consistent. However, the levels of both branches are not logical equivalents. For example, an executive
secretary is not the logical equivalent of a chief operating officer.
Figure 2. Unbalanced hierarchy. Example of an unbalanced hierarchy

Ragged
A hierarchy in which each level has a consistent meaning, but the branches have inconsistent
depths because at least one member attribute in a branch level is unpopulated. A ragged
hierarchy can represent a geographic hierarchy in which the meaning of each level such as city or country
is used consistently, but the depth of the hierarchy varies. Figure 3 shows a geographic hierarchy that has
Continent, Country, Province/State, and City levels defined. One branch has North America as the
Continent, United States as the Country, California as the Province or State, and San Francisco as the City.
However, the hierarchy becomes ragged when one member does not have an entry at all of the levels. For
example, another branch has Europe as the Continent, Greece as the Country, and Athens as the City, but
has no entry for the Province or State level because this level is not applicable to Greece for the business
model in this example. In this example, the Greece and United States branches descend to different
depths, creating a ragged hierarchy.
Figure 3. Ragged hierarchy. Example of a ragged hierarchy
OTHER IMPORTANT QUESTIONS:

1. Differnce between Normalization and Denormalization?

Normalization is the process of removing redundancies.


OLTP uses the Normalization process

Denormalization is the process of allowing redundancies.


OLAP/DW uses the denormalized process to capture greater level of detailed data (each and every transaction)

2. Why fact table is in normal form?

A fact table consists of measurements of business requirements and foreign keys of dimensions tables as per
business rules.

A fact table consists of measurements of business requirements and foreign keys of dimensions tables as per
business rules.

There can just be SKs within a Star schema, which itself is de-Normalized. Now, if there were then FKs on the
dimensions as well, I would agree. Being in normal form, more granularity is achieved with less coding i.e. less
number of joins while retrieving the fact.

3. What is conformed fact?

Conformed facts are allowed to have the same name in separate tables and can be combined and compared
mathematically. Conformed dimensions are those tables that have a fixed structure. There will b no need to change
the metadata of these tables and they can go along with any number of facts in that application without any
changes

4. What are indexes?

Indexes play an important role in data warehouse performance, as they do in any relational database. Every
dimension table must be indexed on its primary key. Indexes on other columns such as those that identify levels in
the hierarchical structure can also be useful in the performance of some specialized queries.

The fact table must be indexed on the composite primary key made up of the foreign keys of the dimension tables.
These are the primary indexes needed for most data warehouse applications because of the simplicity of star and
snowflake schemas. Special query and reporting requirements may indicate the need for additional indexes.

5. What are factless facts?

There are two kinds of fact tables which don't have any facts at all. They are called FACTLESS FACT TABLES. As
per Kimball's law, every M-M relationship is a fact table. They may consists of nothing but keys.
First type of factless fact table records an event i.e. Attendence of the student. Many event tracking tables in the
dimensional DWH turns out to be factless table.
Second type of factless fact table is called coverage table. Coverage tables are frequently needed in
(Dimensional DWH) when the primary fact table is sparse.

A factless fact table captures the many-to-many relationships between


dimensions, but contains no numeric or textual facts. They are often used to record events or
coverage information. Common examples of factless fact tables include:
- Identifying product promotion events (to determine promoted products that didnt sell)
- Tracking student attendance or registration events
- Tracking insurance-related accident events
- Identifying building, facility, and equipment schedules for a hospital or university

6. Key Elements of Multi-Dimensional Metadata

Table 14-1 describes key elements of multi-dimensional metadata:

Table 14-1. Key Elements of Multi-Dimensional Metadata

Term Definition
Aggregate Pre-stored summary of data or grouping of detailed data which satisfies a specific
business rule. Example rules: sum, min, count, or combinations of them.
Level A specific property of a dimension. Examples: size, type, and color.
Cube A set of related factual measures, aggregates, and dimensions for a specific dimensional
analysis problem. Example: regional product sales.
Dimension A set of level properties that describe a specific aspect of a business, used for analyzing
the factual measures of one or more cubes which use that dimension. Examples:
geography, time, customer, and product.
Drilling Drilling is the term used for navigating through a cube. This navigation is usually
performed to access a summary level of information or to provide more detailed
properties of a dimension in a hierarchy.
Fact A fact is a time variant measurement of quantitative data in a cube; for example, units
sold, sales dollars, or total profit.
Hierarchy Hierarchy concept refers to the level of granularity represented by the data in a particular
dimension of a cube. For example, state, county, district, and city represent different
granularity in the hierarchy of the geography dimension.
Measure Means for representing quantitative data in facts or aggregates. Example measures are
total sales or units sold per year.
Normalization A process used for reducing redundancies and removing anomalies in related dimension
tables in various hierarchies.
Redundancy Term used for referring to duplication of data among related tables for the sake of
improving the speed of query processing.
Star Schema A normalized multi-dimensional model in which each disjoint dimension is represented by
a single table.
SnowFlake A normalized multi-dimensional model in which at least one dimension is represented by
Schema two or more hierarchically related tables.

Вам также может понравиться