Data Stage

Datawarehouseing: A data warehouse is a collection of data gathered and organize
d so that it can easily by analyzed, extracted,synthesized, and otherwise be use

d for the purposes of further understanding the data. It may be contrasted with
data that is gathered to meet immediate business objectives such as order and pa
yment transactions, although this data would also usually become part of a data
warehouse.
Data warehouse
Data Warehouse is a central managed and integrated database containing data from
the operational sources in an organization (such as SAP, CRM, ERP system, files
, servises, xmls). It may gather manual inputs from users determining criteria a
nd parameters for grouping or classifying records.
That database contains structured data for query analysis and can be accessed by
users. The data warehouse can be created or updated at any time, with minimum d
isruption to operational systems. It is ensured by a strategy implemented in a E
TL process.
A source for the data warehouse is a data extract from operational databases. Th
e data is validated, cleansed, transformed and finally aggregated and it becomes
ready to be loaded into the data warehouse.
Data warehouse is a dedicated database which contains detailed, stable, non-vola
tile and consistent data which can be analyzed in the time variant.
Sometimes, where only a portion of detailed data is required, it may be worth co
nsidering using a data mart. A data mart is generated from the data warehouse an
d contains data focused on a given subject and data that is frequently accessed
or summarized.
A scheduled ETL process populates data marts within the subject specific data wa
rehouse information.
Business Intelligence - Data Warehouse - ETL:
So, a data warehouse is a

Ø subject-oriented
Ø integrated
Ø time-variant
Ø nonvolatil
The development of data warehousing technologies provides to database profession
al and decision makers clean and integrated data that is already transformed an
d summarized, therefore making it an appropriate environment for more efficient
decision support systems (DSS) and entreprise information systems (EIS) applica
tions, such as Data Mining or Olap systems.
Data mining is the search for relationships and global patterns that exist in la
rge databases but are `hidden' among the vast amount of data, such as a relation
ship between patient data and their medical diagnosis. These relationships repre
sent valuable knowledge about the database and the objects in the database and,
if the database is a faithful mirror, of the real world registered by the databa
se.
Data warehouses were introduced with the purpose of providing support for decisi
on making,
through the use of such tools as On-Line Analytical Processing and Data Mining
. The data stored in a warehouse comes from autonomous, heterogeneous and geogra
phically dispersed information sources. Therefore, the warehousing approach cons
ists of collecting, integrating, and storing in advance all information needed
for decision making, so that this information is available for direct querying,
data mining and other analysis. So the process of data mining aims at transformi
ng the amount of data stored in data warehouses repositories to knowledge.
In Data Warehouse environments, the relational model can be transformed into the
following architectures:
# Star schema
# Snowflake schema
# Constellation schem
Star schema architecture:
Star schema architecture is the simplest data warehouse design. The main feature
of a star schema is a table at the center, called the fact table(child table) a
nd the dimension tables(parent table) which allow browsing of specific categorie
s, summarizing, drill-downs and specifying criteria.
Typically, most of the fact tables in a star schema are in database third normal
form, while dimensional tables are de-normalized (second normal form).
Snowflake Schema architecture:
Snowflake schema architecture is a more complex variation of a star schema desig
n. The main difference is that dimensional tables in a snowflake schema are norm
alized, so they have a typical relational database design.
Snowflake schemas are generally used when a dimensional table becomes very big a
nd when a star schema can t represent the complexity of a data structure. For exam
ple if a PRODUCT dimension table contains millions of rows, the use of snowflake
schemas should significantly improve performance by moving out some data to oth
er table (with BRANDS for instance).
Fact constellation schema architecture:
For each star schema or snowflake schema it is possible to construct a fact cons
tellation schema. This schema is more complex than star or snowflake architectur
e, which is because it contains multiple fact tables. This allows dimension tabl
es to be shared amongst many fact tables.
That solution is very flexible, however it may be hard to manage and support.
What is ETL?
A process of gathering, converting and storing data, often from many loaction. t
he data is often converted from one format to another in the porcess. ETL is an
abbreviation for "Extrace, Transform and Load".At present the most popular and
widely used ETL tools and applications on the market are:
# IBM Websphere DataStage (Formerly known as Ascential DataStage and Ardent Data
Stage)
# Informatica PowerCenter
# Oracle Warehouse Builder
# Ab Initio
# Pentaho Data Integration - Kettle Project (open source ETL)
# SAS ETL studio
# Cognos Decisionstream
# Business Objects Data Integrator (BODI)
# Microsoft SQL Server Integration Services (SSIS)
* extracting data from source operational or archive systems which are the p
rimary source of data for the data warehouse
* transforming the data - which may involve cleaning, filtering, validating
and applying business rules
* loading the data into a data warehouse or any other database or applicatio
n that houses data
The ETL process is also very often referred to as Data Integration process and E
TL tool as a Data Integration platform.
The terms closely related to and managed by ETL processes are: data migration, d
ata management, data cleansing, data synchronization and data consolidation.
The main goal of maintaining an ETL process in an organization is to migrate and
transform data from the source OLTP systems to feed a data warehouse and form d
ata marts.
Business Intelligence?
Business intelligence (BI) is a broad category of application programs and techn
ologies for gathering, storing, analyzing, and providing access to data to help
enterprise users make better business decisions. BI applications include the act
ivities of decision support, query and reporting, online analytical processing (
OLAP), statistical analysis, forecasting, and data
mining. Examples : BusinessObjects
Datastage was formerly known as Ardent DataStage followed by Ascential DataStage
and in 2005 was acquired by IBM and added to the WebSphere family. Starting fro
m 2006 its official name is IBM WebSphere Datastage and in 2008 it has been rena
med to IBM InfoSphere Datastage.
DataStage iis a product ffrom IBM being used as tthe strategic ETL ttool withiin
many
organizations. It can be used ffor multiple purposes:: Interfacing between multi
ple databases. Changing of data from one format to another. Eg. From database to
flat files, XML files, etc. Fast access to data that doesn t change often. Interac
ts with WebSphere MQ to provide real time processing capabilities triggered by e
xternal messages.
Usage of DataStage within organizations:
DataStage has Windows Clients which connect tto tthe Server on tthe Unix / Windo
ws or
Mainfframe platfform. The clients can be used tto develop,, deploy and run datas
tage jobs. In a deployment environment,tthejjobs can be kiicked off tthrough scr
ipts directly on Uniix servers.
The core DataStage client applications are common in all versions of Datastage;
those are:
* Administrator - Administers DataStage projects, manages global settings an
d interacts with the system
* Designer - used to create DataStage jobs and job sequences which are compi
led into executable programs. It is a main module for developers.
* Director - manages running and monitoring DataStage jobs. It is mainly use
d by operators and testers.
* Manager - for managing, browsing and editing the data warehouse metadata r
epository.

Data Stage

Загружено:

Сведения о документе

Исходное описание:

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Data Stage

Загружено:

Авторское право:

Доступные форматы

Datawarehouseing: A data warehouse is a collection of data gathered and organize

d so that it can easily by analyzed, extracted,synthesized, and otherwise be use

So, a data warehouse is a

Вам также может понравиться