Вы находитесь на странице: 1из 16

Data Warehousing

Data warehousing is a processing a huge amount of electronic data stored in recent


years and use that data to accomplish goals that go beyond the routine tasks linked
to daily processing

Data Warehouse
A data warehouse is a collection of data that supports decision-making processes.
An integrated collection of databases rather than a single database is a Data
Warehouse. It must be considered as the single source of information for all decision
support processing and all informational applications throughout the organization. It
provides the following features :
It is subject-oriented.
It is integrated and consistent.
It shows its evolution over time and it is not volatile.
It used to support management
Decision-making processes
Business intelligence.
Emerge the information and knowledge needed to effectively manage
the organization.
Investigation of key challenges and research directions for this
discipline.
It Comprises of data that belong to different imformation subject areas.
It contains different catagories of data.
The process to access heterogenous data , its cleanng transformation , and storing
the data in a structure that is easy to access , understand and use us carried out by
the data warehouse. This data id finally used for report generation , querying and
data analysis.
Warehouse catalog
The warehouse catalog is the subsystem that stores and manages all the metadata.
The metadata refers to such information as data element mapping from source to
target, data element meaning information for information systems and business
users, the data models (both logical and physical), a description of the use of the
data, and temporal information.

Architectural Properties of Data Warehousing System


The following are the architecture properties of data warehouse system
Separation - Analytical and transactional processing should be kept apart as much
as possible.
Scalability - Hardware and software architectures should be easy to upgrade as the
data volume, which has to be managed and processed, and the number of users
requirements, which have to be met, progressively increase.
Extensibility - The architecture should be able to host new applications and
technologies without redesigning the whole system.

Security - Monitoring accesses is essential because of the strategic data stored in


data warehouses.
Administerability - Data warehouse management should not be overly difficult.

Data Warehouse Architectures


Single Layer
Single-Layer Architecture
Single-layer architecture is not frequently used in practice.
Its goal is to minimize the amount of data stored and to reach this goal,
it removes data redundancies.
In this case, data warehouses are virtual.

Two Layer
The requirement for separation plays a fundamental role in defining the typical
architecture for a data warehouse system. Although it is typically called two-layer
architecture to highlight a separation between physically available sources and data
warehouses, it consists of the following 4 layers :
1) Source layer

It uses heterogeneous sources of data.

Data is originally stored to

Corporate relational databases

Legacy databases
Information systems outside the corporate walls

2) Data Staging
The data stored to sources should be extracted, cleansed to remove inconsistencies
and fill gaps, and integrated to merge heterogeneous sources into one common
schema. The so called Extraction, Transformation, and Loading tools (ETL) can
merge heterogeneous schemata, extract, transform, cleanse, validate, filter, and load
source data into a data warehouse. ETL takes place once when a data warehouse is
populated for the first time, then it occurs every time the data warehouse is regularly
updated ETL consists of four separate phases: extraction (or capture), cleansing (or
cleaning or scrubbing), transformation, and loading.

Extraction
Relevant data is obtained from sources in the extraction phase. You can use static
extraction when a data warehouse needs populating for the first time. Incremental
extraction, used to update data warehouses regularly. The data to be extracted is
mainly selected on the basis of its quality.

Cleansing
The cleansing phase is crucial in a data warehouse system because it is supposed
to improve data quality . few mistakes and inconsistencies that make data dirty are :
Duplicate data
Inconsistent values that is logically
Missing data Such as a customers job in ETL tools.
Impossible or wrong values

Loading
Loading into a data warehouse is the last step to take. Loading can be carried out in
two ways:
Refresh Data warehouse data is completely rewritten. This means that older data
is replaced.
Update Only those changes applied to source data are added to the data
warehouse. Update is typically carried out without deleting or modifying preexisting
data. This technique is used in combination with incremental extraction to update
data warehouses regularly.
3) Data warehouse layer :

Information is stored to one logically centralized single


repository is called as data warehouse.
The data warehouse can be directly accessed, but it can also be used
as a source for creating data marts, which partially replicate data
warehouse contents.
Meta-data repositories store information on sources, access
procedures, data staging, users, data mart schemata, and so on.
4) analysis: In this layer, integrated data is efficiently and flexibly accessed to issue
reports, dynamically analyze information, and simulate hypothetical business
scenarios

Difference between Data Marts and Data Warehouse


The architectural difference between data warehouses and data marts needs to be
studied closer. The component marked as a data warehouse is also often called the
primary data warehouse or corporate data warehouse. It acts as a centralized
storage system for all the data being summed up. Data marts can be viewed as
small, local data warehouses replicating (and summing up as much as possible) the
part of a primary data warehouse required for a specific application domain.
Three Layer
In this architecture, the third layer is the reconciled data layer or operational data
store. This layer materializes operational data obtained after integrating and
cleansing source data. As a result, those data are integrated, consistent, correct,
current, and detailed. The main advantage of the reconciled data layer is that it
creates a common reference data model for a whole enterprise. At the same time, it
sharply separates the problems of source data extraction and integration from those
of data warehouse population. Remarkably, in some cases, the reconciled layer is
also directly used to better accomplish some operational tasks, such as producing
daily reports that cannot be satisfactorily prepared using the corporate applications,
or generating data flows to feed external processes periodically so as to benefit from

cleaning and integration. However, reconciled data leads to more redundancy of


operational source data.
This approach can be described as a hybrid solution between the single-layer
architecture and the two/three-layer architecture.
This approach assumes that although a data warehouse is available, it is
unable to solve all the queries formulated.
This means that users may be interested in directly accessing source data from
aggregate data (drill-through)

A perspective on decision support applications


Successfully supporting managerial decision making has become critically
dependent upon the availability of integrated, high quality information organized and
presented to managers in a timely and easily understood manner. Data warehouses

have emerged to meet this need. Surrounded by analytical tools and models, data
warehouses have the potential to transform operational data into business
intelligence; enabling effective problem and opportunity identification and critical
decision making, as well as strategy formulation, implementation, and evaluation.
Content Management :
Managing the content of a data warehouse is a daunting task. These
operational systems draw data from a variety of databases that operate on
different hardware platforms, use different operating systems and DBMSs,
and have different database structures with varying structural, conceptual,
and instance level semantics.
Major challenges remain for data warehouse content management:
These include identifying and accessing the appropriate data sources,
coordinating data capture from them in an appropriate timeframe.
A data warehouse serves as a repository for data extracted from diverse
operational information systems.
The extraction, transformation, and loading (ETL) functions in a data
warehouse are considered the most time-consuming and expensive portion of
the development lifecycle.
Often such operational systems were not designed to be integrated and data
extracts are performed manually or on a schedule determined by the
operational systems.
As a result data in the data warehouse may reflect different states of different
systems. Data extracted from an inventory system.
Coordination mechanisms must be established.

Integration and Design


Given that the data from varied sources have been loaded into the data warehouse,
the next set of challenges is the determination, representation, and conceptual integration
of the data that are "relevant" to the managerial decision making in an organization.
Methodologies for these tasks are in their infancy. The challenge is to integrate data from
diverse information systems in the face of organizational or economic constraints that
require those systems to remain autonomous.

Clearly the data warehouse must go beyond its current role as a repository of
historical data describing the operations and transactions in which the organization
has engaged. It must include data describing partners and partnerships, policies and
rules of the business, competitors and markets, goals and standards, opportunities
and problems, and alternatives and predicted futures

Support
Organizations are using data warehousing to support strategic and mission-critical
applications. Data deposited into the data warehouse must be transformed into
information and knowledge and appropriately disseminated to decision makers within
the organization and to critical partners in various supply chains . problems that need
to be addressed n this area are
1) Selection of proper analytical and data minig tools
2) Privacy and security of data
3) System performance
4) Adequate level of training and support

Data warehouse modelling


Data warehouse design is sometime referred as data warehouse model . A data warehouse
model represents an

integrated,

subject-oriented,

and very granular base

of strategic information that serves as a single source for the decision support environment.
Data warehouse model = abstract model, supported by graphical and lexical documentation
representing the data warehouse content that is involved in analytics applications
Difference between Data warehousing and OLTP model

Purpose and Source of Data


Data in a warehouse is Subject oriented whereas in an OLTP environment it is
Transaction Oriented
For a data warehouse, the data is consolidated data sourced from various OLTP
databases. In OLTP systems the data is mainly operational
Data warehouse uses the information stored for Decision Support, Extended
Planning and Problem Solving, but OLTP uses data to execute fundamental
business tasks
Space Requirements and Processing Speeds
In case of OLTP environments, the processing speed is very fast typically and
the storage requirements can be relatively small when compared to a
Warehouse.
In case of the warehouse model, the speed of processing depends on the
amount of data involved. If complex queries or batched processes are involved
then it may run into hours and storage requirements are also large due to the
existence of aggregation structures and historical data.

OLTP Applications are characterized by short online transactions such as INSERT,


UPDATE, and DELETE in large numbers. OLTP has faster query processing,
maintenance of data integrity in multi-access environments and effectiveness
measured by the number of transactions done per second. Data available in an
OLTP database is current and is an entity based model which is normalized
somewhere to 3NF.
Warehousing Applications are fairly limited number of transactions, but the queries
are mostly very complex and involve aggregations which should have shorter
response times. Warehouses host data that is historical, aggregated and stored in
multi-dimension schemas.

Data Warehouse Modeling Approaches


Each data warehouse deployment project can have different approaches like:
a) Global data warehouse Architecture
b) Independent data mart Architecture
c) Interconnected data mart Architecture
(or)
a combination of the above architectures

Global
it is designed and created based on the holistic needs of the enterprise. This can act
as a common repository for decision support data across the entire enterprise.
The term global in this warehouse architecture doesnt apply to be only a centralized
scheme (or a physical location), but reflects the scope and access of data across the
organization. The data warehouse could also be distributed across different physical
locations.
The major issue in setting up this kind of data warehouse is time and cost involved
when it is spanning multiple geographic locations.

Independent Data Mart


Independent data mart architecture implies stand-alone data marts that are
controlled by a particular workgroup, department, or line of business and are built
solely to meet their needs. There may, in fact, not even be any connectivity with data
marts in other workgroups, departments, or lines of business. the data in any
particular data mart will be accessible only to those in the workgroup, department, or
line of business that owns the data mart.

Interconnected Data Mart

Interconnected data mart architecture is basically a distributed implementation.


Although separate data marts are implemented in a particular workgroup,
department, or line of business, they can be integrated, or interconnected, to provide
a more enterprise-wide or corporate-wide view of the data. In fact, at the highest
level of integration, they can become the global data warehouse.

Approaches to Implement the Architecture


There are several approaches available to implement the above architectures. The
major approaches are: top down, bottom up, or a combination of both.

Top-Down Implementation
A top down implementation requires more planning and design work to be completed
at the beginning of the project. This brings with it the need to involve people from
each of the workgroups, departments, or lines of business that will be participating in
the data warehouse implementation. Decisions concerning data sources to be used,
security, data structure, data quality, data standards, and an overall data model will
typically need to be completed before actual implementation begins. However, the
cost of the initial planning and design can be significant. It is a time-consuming
process and can delay actual implementation, benefits, and return-on-investment.

Bottom Up Implementation
A bottom up implementation involves the planning and designing of data marts
without waiting for a more global infrastructure to be put in place. This does not

mean that a more global infrastructure will not be developed; it will be built
incrementally as initial data mart implementations expand. This approach is more
widely accepted today than the top down approach because immediate results from
the data marts can be realized and used as justification for expanding to a more
global implementation. The bottom up implementation approach has become the
choice of many organizations, especially business management, because of the
faster payback.
Considerations While Choosing data warehouse modelling approach

Incorporate several modeling techniques in a well-balanced and integrated


approach.
Each of the modeling techniques, which are a part of the approach, should
have its own area of applicability.
End user communication should be well-integrated and simple to interpret.
Metrics to assess quality and feasibility of the models must be in place.

Data warehouse modeling - Techniques and guidelines


Multi-dimensional data modeling
Entity-relationship modeling
Temporal data modeling
Pattern-oriented data modeling
Data architecture modeling Master Data Management
The Multi-dimensional data modelling technique is discussed in detail.

OLAP (online Analytical Processing )


When we look at the analysis of data across a multi-dimension, it is called as
OLAP(OnLine Analytical Processing). It is a multi-dimensional logical view of data
allowing the end users or analysts for forecasting, trend analysis, statistical analysis
etc.
Multi-dimensional analysis
Analysis of data along several dimensions
Example:
"Analysis of sales revenue by product category, store, customer group,
over the last 4 quarters"

Some Basic Concepts of Data Modelling


1) Fact : A fact is a collection of related measures plus their associated
dimensions, represented by dimension keys

Facts contain:
Dimension Keys
Each dimension key is a references to a dimension
grain Measures and supportive measures
2) Dimension : A dimension is a collection of members or units of the same type
of views. A dimension provides a certain business context to each measure
common dimensions could be:
Time
Location/region
Customers
Salesperson
Grain of a dimension : The grain of a dimension is the lowest level of detail available within
that dimension

3) Measure : A measure is a numeric attribute of a fact, representing the


performance or behaviour of the business relative to the dimensions. A
measure is a data item used by end-users in their business queries to
measure the performance or the behaviour of a business process or of a
business objectThe measure focuses in on what is being evaluated.
Granularity of measure : The granularity of a measure is determined by the
combination of the grains of all its dimensions

Basic OLAP Operations

1) Drill down and roll up

Drill-down
Exploring facts at more detailed levels
Roll-up
Aggregating facts at less detailed levels

2) Slide and dice

Slice and dice are the operations for browsing the data through the visualized cube.
Slicing cuts through the cube so that users can focus on some specific perspectives.
Dicing rotates the cube to another perspective so that users can be more specific with the
data analysis.

Dimensional modeling gives us an improved capability to visualize the very


abstract questions that the business end users are required to answer. Utilizing
dimensional modeling, end users can easily understand and navigate the data
structure and fully exploit the data. To create a data model, we must first understand
the Business process. Capture the requirements, analyse them and validate the
requirements using an initial model i.e. assess the feasibility of the model and
assess the efficiency of the model. Once the validation of the requirements is done,
then a detailed dimensional model is developed.The detailed dimension model can be
further extended and optimized.

Requirement Analysis
Requirement analysis is used to build an initial dimensional model that represents
the end user requirements which were previously captured in an informal way. This
output of this phase acts as an input for requirement modeling activities, once they
have passed the requirements validation phase. The deliverables of this phase
consist of a combination of
Initial dimensional data models
Business directory or metadata definitions of all elements of the Multi-dimension
model
The end user requirements can be classified into two major categories:
Process-oriented requirements: These represent the major information
processing elements which the end users are performing. Process oriented
requirements may be Business Objectives or Business Queries (represent
Information-oriented requirements: These represent the major data items which
the end users require for their data analysis activities.
The ultimate scope of requirement analysis be summarized as
Gather and interpret business requirements and formulate a business question.
Candidate measures, facts, and dimensions are determined.
Grains of dimensions and granularities of measures and facts are indomitable.

Dimension hierarchies and aggregation levels are determined.


Initial Multi-Dimension model is developed
Business directory definitions are established for the model.
There are several approaches to construct capture the requirements and its
artefacts. The most common approaches are:
Query-oriented approach : Measures and their associated dimensions are

determined first. Then, the facts are established. This follows the natural query
oriented approach by picking the end user queries as the first source of information.
Business-oriented approach
This approach tries to capture the fundamental elements of the business problem
basically. Firstly, the facts are determined through the analysis of the problem
domain from the business point of view. Then, the dimensions and measures are
added to the model.
Data source oriented approach
This approach focuses on the source databases models to determine the
dimensions followed by the measures and facts.

Requirements modelling
After the requirements have been validated, the requirements can be represented as
a model. The model can be an initial multi-dimension model or a concrete model
represented using cubes or a mathematical notation technique representing points in
a multi-dimension space. These representations may be appealing, especially
cubes, but their complexity increases exponentially as the dimensionality increases.
For simplicity well keep the model as a cubical dimensional model.

The requirements modelling activities can be distinguished into two broad groups as
(i) Base Techniques - used for producing the logical models for the dimensions in
the initial model. These techniques of dimension modeling involve:
Adding Dimension attributes which aid in selecting the relevant facts
Dimension browsing exploring the dimension to detect and set the appropriate
selection and aggregation constraints used in subsequent analysis of facts.
Once the dimension attributes and facts are gathered a detailed dimension model is
prepared.
(ii) Detailed dimension modeling which should incorporate structure of the
dimension as well as all of its attributes.

The proposed approach for modeling the dimensions consists of the following
activities for each dimension hierarchy:
Create an entity for each of the aggregation levels within the hierarchy and add
identifiers for each of the dimension entities.
Link the entities in a hierarchical structure and add required attributes to each
dimension entity(useful/relevant/requested by end user)
Demote aggregation levels which do not have any associated attributes from the
dimension entities into dimension attributes.
This kind of approach leads to the so called snowflake models, because of its
support to standardize the dimension hierarchies and aggregation levels

Considerations for building detailed dimension modeling


Dimension hierarchies identified on the basis of end user requirements may not be
the best. Other dimension hierarchies may be identified during requirements
modeling.
Structure of the dimension hierarchies, which has been identified before, must not
be taken for granted. When modeling dimensions, new structures may be identified
and proposed to end users as better solutions.
All the dimension hierarchies and aggregation levels in the hierarchies should be
standardized for achieving maximum consistency of information analysis.
Additionally, recommended that the model should be verified for typical relationships
that exist\ within the dimension.
There are three important types of dimension hierarchies which appear in multidimension data models, they are:
Classification relationships: Classifications are arrangements of like objects into
groups or classes. Example Sales items or products may be classified according to
sales oriented properties and also according to manufacturing and stock oriented
characteristics.
Construction relationships: If relationships are represented in the form of Bill of
Materials and are used by information analysts to explore the construction
relationships between objects and their parts, they fall under this category. Example
Calculate the cost of a product using the cost associated with the products
components and the cost for construction of that product.
Variation relationships: These are used to differentiate objects in terms of
models, versions, implementations etc. Example when a version of a product is sold
to customers when the original item is not available. All these types of relationships
are candidates for being used in the dimension models, but exactly how many of
them should be used in a model can be determined by the judgement of the
modeler.

Multi-Dimensional Model Structures

There are two basic models that can be used in dimensional modeling:
Star model
Snowflake model

Star Schema
Star schema has become a common term used to connote a dimensional model.
Database designers have long used the term star schema to describe dimensional
models because the resulting structure looks like a star and the logical diagram looks
like the physical schema. Each of the dimension table is a denormalized construct
which holds all the attributes of all the aggregation levels in all of the hierarchies of a
given dimension.
Very pragmatic approach

Model looks very much like the user thinks

Easier to use

Querying usually very efficient


Best suited for most M-OLAP tools

Snowflake Schema
This is a representation of a multidimensional data model in which the dimension
hierarchies are structured and normalized. When the schema has data that is
normalized, there is always a possibility of minimal redundancy when compared to
Star schema. This model is highly useful in situations where the models have
dimensions that are really very complex. Increased modeling and design flexibility
Hints and Tips while making a multi Dimensional model

Properties of measures
Business-related facts
Fact identifiers, dimension keys and uniqueness
Dimension roles

(i) Properties of measures


Measures that are obtained can be classified into three major categories. They are:
Additive measures measures which can be added across all of its dimensions.
Example Quantity_Changed, Rental_Amount, etc.
Semi-additive measures measures which can only be added across some of its
dimensions. It depends on the aggregation that is made in the analysis. Example
Total_Quantity_Available (ignoring the Time dimension).

Non-additive measures measures cannot be added across any of its dimensions.


Example Ratios like (Total_Sales/Average_Quantity_On_Hand).
(ii) Business-related facts
Facts can be represented as
a business transaction or a business event (Example: a Sale, representing
what was bought, where and when the sale took place, who bought the
item, how much was paid for the item sold, possible discounts involved in the
sale, etc.).
the states of a given business object (Example: the Inventory state,
representing what is stored where and how much of it was stored during a
given period).
changes to the state of a given business object (Example: Inventory
changes, representing item movements from one inventory to another and
the amount involved in the move, etc.)
(iii) Fact identifiers, dimension keys and uniqueness
It is a recommendation to assign a unique identifier to all the facts in the model and
to have the identification value, i.e. a primary key with each fact, and to be assigned
automatically during the populating process.
(iv) Dimension roles
A fact can have several dimension keys which are basically different roles of the
same dimension.

Solution Validation Techniques


Once requirements analysis is done, the requirements are mapped to a multidimensional data model. This model must be validated against the end user
requirements and candidate data sources with which the models have been mapped
are identified and accounted. Base activities involved in accessing the initial data
model are to check for the:
Validity of the model - Does the model functionally covers the end-users
requirements and is it consistent with models of the business?
Feasibility
o Is the required source data available and can it be gathered?
o Hot Spot analysis of the populating process, including a performance assessment.
Efficiency
o Hot Spot analysis of potential query performance problems.
o Sizing of fact entities and dimension structures.

Вам также может понравиться