Академический Документы
Профессиональный Документы
Культура Документы
requirements.
Gather accurate data by data analysis and functional analysis.
Development of data model:
Create standard abbreviation document for logical, physical and dimensional data models.
Create logical, physical and dimensional data models(data warehouse data modelling).
Document logical, physical and dimensional data models (data warehouse data modelling).
Reports:
Generate reports from data model.
Review:
Review the data model with functional and technical team.
Creation of database:
Create sql code from data model and co-ordinate with DBAs to create database.
Check to see data models and databases are in synch.
Support & Maintenance:
Assist developers, ETL, BI team and end users to understand the data model.
Maintain change log for each data model.
Steps to create a Data Model
These are the general guidelines to create a standard data model and in real time, a data model
may not be created in the same sequential manner as shown below. Based on the enterprises
requirements, some of the steps may be excluded or included in addition to these.
Sometimes, data modeler may be asked to develop a data model based on the existing database.
In that situation, the data modeler has to reverse engineer the database and create a data model.
1 Get Business requirements.
2 Create High Level Conceptual Data Model.
3 Create Logical Data Model.
4 Select target DBMS where data modeling tool creates the physical schema.
5 Create standard abbreviation document according to business standard.
6 Create domain.
7 Create Entity and add definitions.
8 Create attribute and add definitions.
9 Based on the analysis, try to create surrogate keys, super types and sub types.
10 Assign datatype to attribute. If a domain is already present then the attribute should be
attached to the domain.
11 Create primary or unique keys to attribute.
12 Create check constraint or default to attribute.
13 Create unique index or bitmap index to attribute.
14 Create foreign key relationship between entities.
15 Create Physical Data Model.
15 Add database properties to physical data model.
16 Create SQL Scripts from Physical Data Model and forward that to DBA.
17 Maintain Logical & Physical Data Model.
18 For each release (version of the data model), try to compare the present version with the
previous version of the data model. Similarly, try to compare the data model with the database to
clearly thinking about the current and future business requirements. Logical data
model includes all required entities, attributes, key groups, and relationships
that represent business information and define business rules.
In the example, we have identified the entity names, attribute names, and
relationship. For detailed explanation, refer to relational data modeling.
Rule
Relationship
Definition
Extract, transform, and load (ETL) in database usage and especially in data warehousing
involves:
The advantages of efficient and consistent databases make ETL very important as the way data
actually gets loaded.
This article discusses ETL in the context of a data warehouse, whereas the term ETL can in fact
refer to a process that loads any database.
The typical real-life ETL cycle consists of the following execution steps:
1. Cycle initiation
2. Build reference data
3. Extract (from sources)
4. Validate
5. Transform (clean, apply business rules, check for data integrity, create aggregates)
6. Stage (load into staging tables, if used)
7. Audit reports (for example, on compliance with business rules. Also, in case of
failure, helps to diagnose/repair)
8. Publish (to target tables)
9. Archive
10. Clean up
Data Warehouse
Staging area is the place where all transformation, cleansing and enrichment is done before
data can flow further.
The Data is extracted from the source system, by various methods (typically called Extraction)
and is placed in the normalized form into the Staging Area. Once in the Staging Area, data is
cleansed, standardized and re-formatted to make to ready for Loading into the Data-Warehouse
Loaded area. We are going to cover the broad details here. The details of staging can be referred
to in Data Extraction and Transformation Design in Data Warehouse.
Staging Area is important not only for Data Warehousing, bit for host of other applications as
well. Therefore, it has to seen from a wider perspective. Staging is an area where a sanitized,
integrated & detailed data in normalized form exists.
With the advent of Data Warehouse, the concept of Transformation has gained ground, which
provides a high degree of quality & uniformity to the data. The conventional (pre-data
warehouse) Staging Areas used to be plain dumps of the production data. Therefore a Staging
Area with Extraction & Transformation is best of both the worlds for generating quality
transaction level information.
DW vs DataMart
Data Warehouse
A Data Warehouse is the area where the information is loaded in under-normalized Dimensional
Modeling form. This subject has been dealt in fair degree of detail in Data Warehousing/Marting
section. A Data Warehouse is a repository of data, which contains data in a under-normalized
dimensional form ACROSS the enterprise. Following are the features of a Data Warehouse:
A Data-Warehouse is the source for most of the end user tools for Data
Analysis, Data Mining, and strategic planning .
It contains uniform & standard dimensions and measures. The details of this
can be referred to Dimensional Modeling Concepts.
It contains only the actuals data: This is linked to 'read-only'. As a best practice,
all the non-actual data (like standards, future projections, what-if scenarios) should be
managed and maintained in OLAP and End-user tools
Data Marts
Data Marts are a smaller and specific purpose oriented data warehouse. Data Warehouse is a big
a strategic platform, which needs considerable planning. The difference in Data Warehouse and
Data Marts is like that of planning a city vs. planning a township. Data Warehouse is a mediumlong term effort to integrate and create single point system of record for virtually all applications
and needs for data. Data mart is a short to medium term effort to build a repository for a specific
analysis. The differences between a Data Warehouse vs. Data mart are as follows:
Data Warehouse
Data Mart
Specific Application
Domain Independent
Specific Domain
Centralized Independent
Planned
Data
Historical, Detailed &
Summarized
A good data warehouse will
capture the history of
transactions by default; even of
there is no immediate need. This
is because a data-warehouse
always tries to be future proof.
Sources
Many Internal & external
Sources
This is an obvious outcome of
the Data Warehouse being a
generic resource. That is also the
reason why the staging design
for a data warehouse takes much
more time compared to that of a
data mart.
Life Cycle
Stand-Along Strategic Initiative:
Long life
Data - Modeling
OLAP: manages large amounts of historical data, provides facilities for summarization and
aggregation, stores information at different levels of granularity to support decision making
process.
3. Database Design
OLTP: adopts an entity relationship(ER) model and an application-oriented database design.
OLAP: adopts star, snowflake or fact constellation model and a subject-oriented database design.
4. View
OLTP: focuses on the current data within an enterprise or department.
OLAP: spans multiple versions of a database schema due to the evolutionary process of an
organization; integrates information from many organizational locations and data stores
ETL tools are used to extract, transformation and loading the data into data
warehouse / data mart
OLAP tools are used to create cubes/reports for business analysis from data
warehouse / data mart
ETL tool is used to extract the data and to perform the operation as per our needs
for eg : informatica data mart but OLAP is completely differ from etl process it is
used for generating report also known as reporting tool eg: bo and cognos.
What is BI?
Business Intelligence is a term introduced by Howard Dresner of Gartner Group in
1989. He described Business Intelligence as a set of concepts and methodologies to
improve decision making in business through use of facts and fact based systems.
What is aggregation?
In a data warehouse paradigm "aggregation" is one way of improving query
performance. An aggregate fact table is a new table created off of an existing fact
table by summing up facts for a set of associated dimension. Grain of an aggregate
fact is higher than the fact table. Aggreagate tables contain fewer rows thus making
quesries run faster.
What are the different approaches for making a Datawarehouse?
This is a generic question: From a business perspective, it is very important to first
get clarity on the end user requirements and a system study before commencing
any Data warehousing project. From a technical perspective, it is important to first
understand the dimensions and measures, determine quality and structure of
source data from the OLTP systems and then decide which dimensional model to
apply, i.e. whether we do a star or snowflake or a combination of both. From a
conceptual perspective, we can either go the Ralph Kimball method (build data
marts and then consolidate at the end to form an enterprise Data warehouse) or the
Bill Inmon method (build a large Data warehouse and derive data marts from the
same. In order to decide on the method, a strong understanding of the business
requirement and data structure is needed as also consensus with the customer.
What is staging area?
Staging area is also called Operational Data Store (ODS). It is a data holding place
where the data which is extracted from all the data sources are stored. From the
Staging area, data is loaded to the data warehouse. Data cleansing takes place in
this stage.
What is the difference between star and snowflake schema?
The main difference between star schema and snowflake schema is that the star
schema is highly denormalized and the snowflake schema is normalized. So the
data access latency is less in star schema in comparison to snowflake schema. As
the star schema is denormalized, the size of the data warehouse will be larger than
that of snowflake schema. The schemas are selected as per the client requirements.
Performance wise, star schema is good. But if memory utilization is a major concern,
then snow flake schema is better than star schema.
What is Data mart?
A data mart is a subset of an organizational data store, usually oriented to a
specific purpose or major data subject, that may be distributed to support business
needs.[1] Data marts are analytical data stores designed to focus on specific
business functions for a specific community within an organization. Data marts are
often derived from subsets of data in a data warehouse, though in the bottom-up
data warehouse design methodology the data warehouse is created from the union
of organizational data marts.
Information in the data warehouse is under the control of data warehouse users so
that, even if the source system data is purged over time, the information in the warehouse
can be stored safely for extended periods of time.
Because they are separate from operational systems, data warehouses provide
retrieval of data without slowing down operational systems.
Data warehouses can work in conjunction with and, hence, enhance the value of
operational business applications, notably customer relationship management (CRM)
systems.
Cube can (and arguably should) mean something quite specific - OLAP artifacts presented
through an OLAP server such as MS Analysis Services or Oracle (nee Hyperion) Essbase.
However, it also gets used much more loosely. OLAP cubes of this sort use cube-aware query
tools which use a different API to a standard relational database. Typically OLAP servers
maintain their own optimized data structures (known as MOLAP), although they can be
implemented as a front-end to a relational data source (known as ROLAP) or in various hybrid
modes (known as HOLAP)
I try to be specific and use 'cube' specifically to refer to cubes on OLAP servers such as SSAS.
Business Objects works by querying data through one or more sources (which could be relational
databases, OLAP cubes, or flat files) and creating an in-memory data structure called a
MicroCube which it uses to support interactive slice-and-dice activities. Analysis Services and
MSQuery can make a cube (.cub) file which can be opened by the AS client software or Excel
and sliced-and-diced in a similar manner. IIRC Recent versions of Business Objects can also
open .cub files.
To be pedantic I think Business Objects sits in a 'semi-structured reporting' space somewhere
between a true OLAP system such as ProClarity and ad-hoc reporting tool such as Report
Builder, Oracle Discoverer or Brio. Round trips to the Query Panel make it somewhat clunky as
a pure stream-of-thought OLAP tool but it does offer a level of interactivity that traditional
reports don't. I see the sweet spot of Business Objects as sitting in two places: ad-hoc reporting
by staff not necessarily familiar with SQL and provding a scheduled report delivered in an
interactive format that allows some drill-down into the data.
'Data Mart' is also a fairly loosely used term and can mean any user-facing data access medium
for a data warehouse system. The definition may or may not include the reporting tools and
metadata layers, reporting layer tables or other items such as Cubes or other analytic systems.
I tend to think of a data mart as the database from which the reporting is done, particularly if it is
a readily definable subsystem of the overall data warehouse architecture. However it is quite
reasonable to think of it as the user facing reporting layer, particularly if there are ad-hoc
reporting tools such as Business Objects or OLAP systems that allow end-users to get at the data
directly.The term "data mart" has become somewhat ambiguous, but it is traditionally associated
with a subject-oriented subset of an organization's information systems. Data mart does not
explicitly imply the presence of a multi-dimensional technology such as OLAP and data mart
does not explicitly imply the presence of summarized numerical data.
A cube, on the other hand, tends to imply that data is presented using a multi-dimensional
nomenclature (typically an OLAP technology) and that the data is generally summarized as
intersections of multiple hierarchies. (i.e. the net worth of your family vs. your personal net
worth and everything in between) Generally, cube implies something very specific whereas
data mart tends to be a little more general.
I suppose in OOP speak you could accurately say that a data mart has-a cube, has-a
relational database, has-a nifty reporting interface, etc but it would be less correct to say that
any one of those individually is-a data mart. The term data mart is more inclusive.
Figure 1-1 Contrasting OLTP and Data Warehousing Environments
Online transaction processing. OLTP systems are optimized for fast and reliable transaction
handling. Compared to data warehouse systems, most OLTP interactions will involve a relatively
small number of rows, but a larger group of tables.
So I think that PL/SQL could be your ETL solution if you have no other tools available.
Another approach might be writing procedures, packages and functions that may be used by an
ETL tool. This is usually done when complicated transformations cannot be efficiently
implemented in the ETL.
As you can see PL/SQL fits well into any ETL process.
Functional capability: This includes both the 'transformation' piece and the
'cleansing' piece. In general, the typical ETL tools are either geared towards having
strong transformation capabilities or having strong cleansing capabilities, but they
are seldom very strong in both. As a result, if you know your data is going to be dirty
coming in, make sure your ETL tool has strong cleansing capabilities. If you know
there are going to be a lot of different data transformations, it then makes sense to
pick a tool that is strong in transformation.
Ability to read directly from your data source: For each organization,
there is a different set of data sources. Make sure the ETL tool you select can connect
directly to your source data.
Metadata support: The ETL tool plays a key role in your metadata because
it maps the source data to the destination, which is an important piece of the
metadata. In fact, some organizations have come to rely on the documentation of
their ETL tool as their metadata source. As a result, it is very important to select an
ETL tool that works with your overall metadata strategy.
with a 'Designer' piece, where the data warehouse administrator can specify the relationship
between the relational tables, as well as how dimensions, attributes, and hierarchies map to
the underlying database tables.
Right now, there is a convergence between the traditional ROLAP and MOLAP vendors.
ROLAP vendor recognize that users want their reports fast, so they are implementing MOLAP
functionalities in their tools; MOLAP vendors recognize that many times it is necessary to
drill down to the most detail level information, levels where the traditional cubes do not get
to for performance and size reasons.
So what are the criteria for evaluating OLAP vendors? Here they are:
Ability to leverage parallelism supplied by RDBMS and hardware: This would
greatly increase the tool's performance, and help loading the data into the cubes as quickly
as possible.
Performance: In addition to leveraging parallelism, the tool itself should be quick both in
terms of loading the data into the cube and reading the data from the cube.
Customization efforts: More and more, OLAP tools are used as an advanced reporting
tool. This is because in many cases, especially for ROLAP implementations, OLAP tools often
can be used as a reporting tool. In such cases, the ease of front-end customization becomes
an important factor in the tool selection process.
Security Features: Because OLAP tools are geared towards a number of users, making
sure people see only what they are supposed to see is important. By and large, all
established OLAP tools have a security layer that can interact with the common corporate
login protocols. There are, however, cases where large corporations have developed their
own user authentication mechanism and have a "single sign-on" policy. For these cases,
having a seamless integration between the tool and the in-house authentication can require
some work. I would recommend that you have the tool vendor team come in and make sure
that the two are compatible.
Metadata support: Because OLAP tools aggregates the data into the cube and
sometimes serves as the front-end tool, it is essential that it works with the metadata
strategy/tool you have selected.
Popular Tools Business Objects
Cognos
Hyperion
MicroStrategy
No attribute is specified.
At this level, the data modeler attempts to identify the highest-level relationships among the
different entities.
Logical Data Model
Features of logical data model include:
Foreign keys (keys identifying the relationship between different entities) are
specified.
At this level, the data modeler attempts to describe the data in as much detail as possible,
without regard to how they will be physically implemented in the database.
In data warehousing, it is common for the conceptual data model and the logical data model
to be combined into a single step (deliverable).
The steps for designing the logical data model are as follows:
Dimensional Model: A type of data modeling suited for data warehousing. In a dimensional
model, there are two types of tables: dimensional tables and fact tables. Dimensional table
records information on each dimension, and fact table records all the "fact", or measures.
Dimensional Table: Dimension tables store records related to this particular dimension. No
facts are stored in a dimensional table.
ETL: Stands for Extraction, Transformation, and Loading. The movement of data from one
area to another.
Fact Table: A type of table in the dimensional model. A fact table typically includes two
types of columns: fact columns and foreign keys to the dimensions.
Hierarchy: A hierarchy defines the navigating path for drilling up and drilling down. All
attributes in a hierarchy belong to the same dimension.
Metadata: Data about data. For example, the number of tables in the database is a type of
metadata.
Metric: A measured value. For example, total sales is a metric.
MOLAP: Multidimensional OLAP. MOLAP systems store data in the multidimensional cubes.
OLAP: On-Line Analytical Processing. OLAP should be designed to provide end users a quick
way of slicing and dicing the data.
ROLAP: Relational OLAP. ROLAP systems store data in the relational database.
Snowflake Schema: A common form of dimensional model. In a snowflake schema,
different hierarchies in a dimension can be extended into their own dimensional tables.
Therefore, a dimension can have more than a single dimension table.
Star Schema: A common form of dimensional model. In a star schema, each dimension is
represented by a single dimension table.
---------The following are the typical processes involved in the datawarehousing project cycle.
Requirement Gathering
Data Modeling
ETL
Performance Tuning
Quality Assurance
Production Maintenance
Incremental Enhancements
This is a very important step in the data warehousing project. Indeed, it is fair to say that the
foundation of the data warehousing system is the data model. A good data model will allow
the data warehousing system to grow easily, as well as allowing for good performance.
In data warehousing project, the logical data model is built based on user requirements, and
then it is translated into the physical data model. The detailed steps can be found in the
Conceptual, Logical, and Physical Data Modeling section.
Part of the data modeling exercise is often the identification of data sources. Sometimes this
step is deferred until the ETL step. However, my feeling is that it is better to find out where
the data exists, or, better yet, whether they even exist anywhere in the enterprise at all.
Should the data not be available, this is a good time to raise the alarm. If this was delayed
until the ETL phase, rectifying it will becoming a much tougher and more complex process.
The ETL (Extraction, Transformation, Loading) process typically takes the longest to develop,
and this can easily take up to 50% of the data warehouse implementation cycle or longer.
The reason for this is that it takes time to get the source data, understand the necessary
columns, understand the business rules, and understand the logical and physical data
models.
Possible Pitfalls
There is a tendency to give this particular phase too little development time. This can prove
suicidal to the project because end users will usually tolerate less formatting, longer time to
run reports, less functionality (slicing and dicing), or fewer delivered reports; one thing that
they will not tolerate is wrong information.
A second common problem is that some people make the ETL process more complicated
than necessary. In ETL design, the primary goal should be to optimize load speed without
sacrificing on quality. This is, however, sometimes not followed. There are cases where the
design goal is to cover all possible future uses, whether they are practical or just a figment
of someone's imagination. When this happens, ETL performance suffers, and often so does
the performance of the entire data warehousing system.
There are three major areas where a data warehousing system can use a little
performance tuning:
ETL - Given that the data load is usually a very time-consuming process (and
hence they are typically relegated to a nightly load job) and that data warehousingrelated batch jobs are typically of lower priority, that means that the window for data
loading is not very long. A data warehousing system that has its ETL process finishing
right on-time is going to have a lot of problems simply because often the jobs do not
get started on-time due to factors that is beyond the control of the data warehousing
team. As a result, it is always an excellent idea for the data warehousing group to
tune the ETL process as much as possible.
Query Processing - Sometimes, especially in a ROLAP environment or in a
system where the reports are run directly against the relationship database, query
performance can be an issue. A study has shown that users typically lose interest
after 30 seconds of waiting for a report to return. My experience has been that ROLAP
reports or reports that run directly against the RDBMS often exceed this time limit,
and it is hence ideal for the data warehousing team to invest some time to tune the
query, especially the most popularly ones.
Report Delivery - It is also possible that end users are experiencing significant
delays in receiving their reports due to factors other than the query performance. For
example, network traffic, server setup, and even the way that the front-end was built
sometimes play significant roles. It is important for the data warehouse team to look
into these areas for performance tuning.
QA
Once the development team declares that everything is ready for further testing, the QA
team takes over. The QA team is always from the client. Usually the QA team members will
know little about data warehousing, and some of them may even resent the need to have to
learn another tool or tools. This makes the QA process a tricky one.
Sometimes the QA process is overlooked. On my very first data warehousing project, the
project team worked very hard to get everything ready for Phase 1, and everyone thought
that we had met the deadline. There was one mistake, though, the project managers failed
to recognize that it is necessary to go through the client QA process before the project can
go into production. As a result, it took five extra months to bring the project to production
(the original development time had been only 2 1/2
In the OLAP world, there are mainly two different types: Multidimensional OLAP (MOLAP) and
Relational OLAP (ROLAP). Hybrid OLAP (HOLAP) refers to technologies that combine MOLAP
and ROLAP.
MOLAP
This is the more traditional way of OLAP analysis. In MOLAP, data is stored in a
multidimensional cube. The storage is not in the relational database, but in proprietary
formats.
Advantages:
Excellent performance: MOLAP cubes are built for fast data retrieval, and is
optimal for slicing and dicing operations.
Can perform complex calculations: All calculations have been pre-generated
when the cube is created. Hence, complex calculations are not only doable, but they
return quickly.
Disadvantages:
Limited in the amount of data it can handle: Because all calculations are
performed when the cube is built, it is not possible to include a large amount of data
in the cube itself. This is not to say that the data in the cube cannot be derived from
a large amount of data. Indeed, this is possible. But in this case, only summary-level
information will be included in the cube itself.
Requires additional investment: Cube technology are often proprietary and do
not already exist in the organization. Therefore, to adopt MOLAP technology, chances
are additional investments in human and capital resources are needed.
ROLAP
This methodology relies on manipulating the data stored in the relational database to give
the appearance of traditional OLAP's slicing and dicing functionality. In essence, each action
of slicing and dicing is equivalent to adding a "WHERE" clause in the SQL statement.
Advantages:
Can handle large amounts of data: The data size limitation of ROLAP
technology is the limitation on data size of the underlying relational database. In
other words, ROLAP itself places no limitation on data amount.
Can leverage functionalities inherent in the relational database: Often,
relational database already comes with a host of functionalities. ROLAP technologies,
since they sit on top of the relational database, can therefore leverage these
functionalities.
Disadvantages:
Performance can be slow: Because each ROLAP report is essentially a SQL
query (or multiple SQL queries) in the relational database, the query time can be long
if the underlying data size is large.
OLAP (online analytical processing) is a function of business intelligence software that enables a
user to easily and selectively extract and view data from different points of view. Designed for
managers looking to make sense of their information, OLAP tools structure data hierarchically
the way managers think of their enterprises, but also allows business analysts to rotate that data,
changing the relationships to get more detailed insight into corporate information.
WebFOCUS OLAP combines all the functionality of query tools, reporting tools, and OLAP into
a single powerful solution with one common interface so business analysts can slice and dice the
data and see business processes in a new way. WebFOCUS makes data part of an organization's
natural culture by giving developers the premier design environments for automated ad hoc and
parameter-driven reporting and giving everyone else the ability to receive and retrieve data in
any format, performing analysis using whatever device or application is part of the daily working
life.
WebFOCUS ad hoc reporting and OLAP features allow users to slice and dice data in an almost
unlimited number of ways. Satisfying the broadest range of analytical needs, business
intelligence application developers can easily enhance reports with extensive data-analysis
functionality so that end users can dynamically interact with the information. WebFOCUS also
supports the real-time creation of Excel spreadsheets and Excel PivotTables with full styling,
drill-downs, and formula capabilities so that Excel power users can analyze their corporate data
in a tool with which they are already familiar.
Business intelligence (BI) tools empower organizations to facilitate improved
business decisions. BI tools enable users throughout the extended enterprise not
only to access company information but also to report and analyze that critical data
in an efficient and intuitive manner. It's is not just about delivering reports from a
data warehouse; it's about providing large numbers of people executives,
analysts, customers, partners, and everyone else secure and simple access to the
right information so they can make better decisions. The best BI tools allow
employees to enhance their productivity while maintaining a high degree of selfsufficiency.