UNIT-V Data Warehousing, Data Mining & OLAP

UNIT-V
DATA Warehousing. Data Warehousing Components. Building a Data Warehouse. Mapping the Data Warehouse to a Multiprocessor Architecture. DBMS Schemas for Decision Support. Data Extraction, cleanup & Transformation Tools. Metadata. Data Mining: Introduction to data mining
Kapil Tomar, IT Deptt. AKGEC 1
What is Data Warehousing

Data Warehousing is an architectural construct of information systems that provides users with current and historical decision support information that is hard to access or present in traditional operational data stores.
Kapil Tomar, IT Deptt. AKGEC
Data Warehouse definition

A formal definition of the data warehouse is offered by W.H. Inmon: A data warehouse is asubject-oriented, integrated, time-variant, nonvolatile collection of data in support of management decisions
Seven data warehouse components

Data sourcing, cleanup, transformation, and migration tools Metadata repository Warehouse/database technology Data marts Data query, reporting, analysis, and mining tools Data warehouse administration and management Information delivery system
Typically, the source data for the warehouse is coming from the operational applications [an exception might be an operational data store (ODS), As the data enters the data warehouse, it is transformed into an integrated structure format. The transformation process may involve conversion, summarization, filtering" and condensation of data. Because data within the data warehouse contains a large historical component (sometimes covering 5 to 10 years), the data warehouse must be capable of holding and managing large volumes of data as well as different data structures for the same database over time.
Sourcing, Acquisition, Cleanup, and Transformation Tools

A significant portion of the data warehouse implementation effort is spent extracting data from operational systems and putting it in a format suitable for informational applications that will run off the data warehouse. perform all of the conversions, summarizations, key changes, structural changes, and condensations needed to transform disparate data into information that can be used by the decision support tool. Removing unwanted data from operational databases
The functionality includes

Removing unwanted data from operational databases Converting to common data names and definitions Calculating summaries and derived data Establishing defaults for missing data Accommodating source data definition changes
The data sourcing, cleanup, extract, transformation, and migration tools have to deal with some significant issues as follows:
Database heterogeneity. DBMSs are very different in data models, data access language, data navigation, operations, concurrency, integrity, recovery, etc.
Data heterogeneity. This is the difference in the way data is defined and used in different models- homonyms, synonyms, unit incompatibility different attributes for the same entity, and different ways of modeling the same fact.
10
Metadata
Metadata is data about data that describes the data warehouse. It is used for building, maintaining, managing, and using the data warehouse. Metadata can be classified into Technical metadata, Business metadata
11
Technical metadata, which contains information about warehouse data for use by warehouse designers and administrators when carrying out warehouse development and management tasks. Technical meta data documents include Information about data source Transformation descriptions, i.e., the mapping method from operational databases into the warehouse, and algorithms used to convert, enhance or transform data Warehouse object and data structure definitions for data targets The rules used to perform data cleanup and data enhancement Data mapping operations when capturing data from source systems and applying it to the target warehouse database Access authorization, backup history, archive history, information delivery history, data acquisition history, data access, etc.
12
Business metadata contains information that gives users an easy-to understand perspective of the information stored in the data-ware house. Business metadata documents information about Subject areas and information object type, including queries, reports, image video, and/or audio clips. Internet home pages. Other information to support all data warehousing components. For example, the information related to the information delivery system (see Sec. 6.8) should include subscription information, scheduling information, details of delivery destinations, and the business query objects such as predefined queries, reports, and analyses.
Data warehouse operational information, e.g., data history (snapshots, versions), ownership, extract audit trail, usage data
Metadata repository management software can be used to map the source data to the target database, generate code for data transformations, integrate and transform the data, and control moving data to the warehouse. One of the important functional components of the metadata repository is the information directory.
14
From a technical requirements point of view, the information directory and the entire metadata repository Should be a gateway to the data warehouse environment, and thus should be accessible from any platform via transparent and seamless connections Should support an easy distribution and replication of its content for high performance and availability Should be searchable by business-oriented key words Should act as a launch platform for end-user data access and analysis tools
15
Should support the sharing of information objects such as queries, reports, data collections, and subscriptions between users Should support a variety of scheduling options for requests against the data warehouse, including on-demand, one-time, repetitive, event-driven, and conditional delivery (in conjunction with the information delivery system) Should support the distribution of the query results to one or more destinations in any of the user-specified formats (in conjunction with the information delivery system) . Should support and provide interfaces to other applications such as email, spreadsheet, and schedulers
16
Access Tools
The principal purpose of data warehousing is to provide information to business users for strategic decision making. These users interact with the data warehouse using front-end tools.
For the purpose of this discussion let's divide these tools into five main groups:
Data query and reporting tools Application development tools Executive information system (EIS) tools On-line analytical processing tools Data mining tools
17
Data query and reporting tools

This category can be further divided into two groups: 1. reporting tools and 2. managed query tools. 1. Reporting tools can be divided into i. production reporting tools and ii. desktop report writers.
18
Production reporting tools will let companies generate regular operational reports or support high-volume 'batch jobs, such as calculating and printing paychecks. Report writers, on the other hand, are inexpensive desktop tools designed for end users.
2. Managed query tools shield end users from the complexities of SQL and database structures by inserting a metalayer between users and the database. The metalayer is the software that provides subject-oriented views of a database and supports point-and-click creation of SQL.
19
Application development tools

in-house application development PowerBuilder from PowerSoft, Visual Basic from Microsoft, Forte from Forte Software, and Business Objects from Business Objects.
20
On-line analytical processing tools

On-line analytical processing (OLAP) tools. These tools are based on the concepts of multidimensional databases and allow a sophisticated user to analyze the data using elaborate, multidimensional views. Typically business applications for these tools include product performance and profitability, effectiveness of a sales program or a marketing campaign, sales forecasting, and capacity planning.
21
Data Mining
A critical success factor for any business today is its ability to use information effectively. Knowing this information, an organization can formulate effective business, marketing, and sales strategies; precisely target promotional activity; discover and penetrate new markets; and successfully compete in the marketplace from a position of informed strength.
A relatively new and promising technology aimed at achieving this strategic advantage is known as data mining.
major attraction of data mining is its ability to build predictive rather than retrospective models.
22
Most organizations engage in data mining to Visualize Data Correct Data Discover knowledge. The goal of knowledge discovery is to determine explicit hidden relationships, patterns, or correlations from data stored in an enterprise's database. Specifically data mining can be used to perform:
Segmentation (e.g. group customer records for custom-tailored marketing) Classification (assignment of input data to a predefined class, discovery and understanding of trends, text document classification) Association (discovery of cross-sales opportunities) Preferencing (determining preference of customer's majority)
PACS
Data Marts
However, the term data mart means different things to different people. A rigorous definition of this term is a data store that is subsidiary to a data warehouse of integrated data. The data mart is directed at a partition of data (often called a subject area) that is created for the use of a dedicated group of users. A data mart might, in fact, be a set of denormalized, summarized, or aggregated data. Sometimes, such a set could be placed on the data warehouse database rather than a physically separate store of data. In most instances, however, the data mart is a physically separate store of data and is normally resident on a separate database server, often on the local area network serving a dedicated user group.
24
it is often a necessary and valid solution to a pressing business problem, thus achieving the goal of rapid delivery of enhanced decision support functionality to end users. The business drivers underlying such developments include Extremely urgent user requirements The absence of a budget for a full data warehouse strategy The. absence of a sponsor for an enterprise wide decision support strategy The decentralization of business units The attraction of easy-to-use tools and a mind-sized project
25
In summary, data marts present two problems: (1) scalability in situations where an initial small data mart grows quickly in multiple dimensions and (2) data integration. Therefore, when designing data marts, the organizations should pay close attention to system scalability, data consistency, and manageability issues. The key to a successful data mart strategy is the development of an overall scalable data warehouse architecture; and the key step in that architecture is identifying and implementing the common dimensions.
26
Data Warehouse Administration and Management

Security and priority management Monitoring updates from multiple sources Data quality checks Managing and updating metadata Auditing and reporting data warehouse usage and status (for managing the response time and resource utilization, and providing chargeback information) Purging data Replicating, subsetting, and distributing data Backup and recovery Data warehouse storage management [e.g., capacity planning, hierarchical storage management (HSM), purging of aged data]
27
Information Delivery System

The information delivery component is used to enable the process of subscribing for data warehouse information and having it delivered to one or more destinations of choice according to some user-specIfIed scheduling algorithm.
28

The information delivery component is used to enable the process of subscribing for data warehouse information and having it delivered to one or more destinations of choice according to some user-specIfIed scheduling algorithm. In other words, the infrormation delivery system distributes warehousestored data and other information objects to other data warehouses and end-user products such as spreadsheets and local databases.
29

Delivery of information may be based on time of day, or on a completion of an external event. The value of data warehousing is maximized when the right information gets into the hands of those individuals who need it, where they need It, and when they need it the most.
30
Building a Data Warehouse
31
Nine-Step Method in the Design of a Data Warehouse

1. Choosing the subject matter 2. Deciding what a fact table represents 3. Identifying and conforming the dimensions 4. Choosing the facts 5. Storing precalculations in the fact table 6. Rounding out the dimension tables 7. Choosing the duration of the database 8. The need to track slowly changing dimensions 9. Deciding the query priorities and the query modes
Benefits of Data Warehousing

Locating the right information Presentation of information (reports, graphs) Testing of hypothesis Discovery of information Sharing the analysis
33
Tangible benefits
Product inventory turnover is improved. Costs of product introduction are decreased with improved selection of target markets. More cost-effective decision making is enabled by separating (ad hoc) query processing from running against operational databases. Better business intelligence is enabled by increased quality and flexibility of market analysis available through multilevel data structures, which may range from detailed to highly summarized. For example, determining the effectiveness of marketing programs allows the elimination of weaker programs and enhancement of stronger ones. Enhanced asset and liability management means that a data warehouse can provide a "big picture of enterprise wide purchasing and inventory patterns, and can indicate otherwise unseen credit exposure and opportunities for cost savings.
Intangible benefits
Improved productivity, by keeping all required data in a single location and eliminating the rekeying of data Reduced redundant processing, support, and software to support overlapping decision support applications Enhanced customer relations through improved knowledge of individual requirements and trends, through customization, Improved communications, and tailored product offerings Enabling business process reengineering - data warehousing can provide useful insights into the work processes themselves, resulting in developing breakthrough ideas for the reengineering of those processes
35
Mapping the Data Warehouse to a Multiprocessor Architecture

The organizations that embarked on data warehousing development deal with ever-increasing amounts of data. Generally speaking, the size of a data warehouse rapidly approaches the point where the search for better performance and scalability becomes a real necessity. This search is pursuing two goals: Speed-up-the ability to execute the same request on the same amount of data in less time
Scale-up-the ability to obtain the same performance on the same request as the database size increases
An additional and important goal is to achieve linear speed-up and scale-up; doubling the number of processors cuts the response time in half (linear speed-up) or provides the same performance on twice as much data (linear scale-up). These goals of linear performance and scalability can be satisfied by parallel hardware architectures, parallel operating systems, and parallel database management systems. Parallel hardware architectures are based on multiprocessor systems designed as a shared-memory model [symmetric multiprocessors (SMPs), shared-disk model, or distributed-memory model [massively parallel processors (MPPs), and clusters of uniprocessors and/or SMPs].
37
Types of parallelism
Horizontal parallelism Vertical parallelism
38
Horizontal parallelism, which means that the database is partitioned across multiple disks, and parallel processing occurs within a specific task (i.e., table scan) that is performed concurrently on different processors against different sets of data. Vertical parallelism, which occurs among different tasks-all component query operations (i.e., scan, join, sort) are executed in parallel in a pipelined fashion. In other words, an output from one task (e.g., scan) becomes an input into another task (e.g., join) as soon as records become available A truly parallel DBMS should support both horizontal and vertical types of parallelism concurrently (see Fig. 8.1, case 4).
40
Data partitioning
Hash partitioning Key range partitioning Schema partitioning User-defined partitioning
41
Data partitioning
Hash partitioning. A hash algorithm is used to calculate the partition number (hash value) based on the value of the partitioning key for each row. Key range partitioning. Rows are placed and located in the partitions according to the value of the partitioning key (all rows with the key value from A to K are in partition 1, L to T are in partition 2, etc.). Schema partitioning. This is an option not to partition a table across disks; instead, an entire table is placed on one disk, another table is placed on a different disk, etc. This is useful for small reference tables that are more effectively used when replicated in each partition rather than spread across partitions. User-defined partitioning. This is a partitioning method that allows a table to be partitioned on the basis of a user-defined expression (e.g., use state codes to place rows in one of 50 partitions) ..
42
Database Architectures for Parallel Processing

Shared-memory architecture Shared-disk architecture Shared-nothing architecture
43
Shared-memory architecture
44
Shared-disk architecture
45
Shared-nothing architecture
46
Parallel DBMS Features

Scope and techniques of parallel DBMS operations Optimizer implementation Application transparency The parallel environment. DBMS management tools Price /performance
47
DBMS Schemas for Decision Support

Data warehousing projects were forced to choose between a data model and a corresponding database schema that is intuitive for analysis but performs poorly and a model-schema that performs better but is not well suited for analysis. The schema methodology that is gaining widespread acceptance for data warehousing is the star schema.
48
Indeed, solving modern business problems such as market analysis and financial forecasting requires querycentric database schemas that are array oriented and multidimensional in nature. These business problems are characterized by the need to retrieve large numbers of records from very large data sets (hundreds of gigabytes and even terabytes) and summarize them on the fly.
49
DBMS Schemas for Decision Support

Star Schema Potential performance problems with star schemas
50
Star Schema
The multidimensional view of data that is expressed using relational database semantics is provided by the database schema design called star schema. The basic premise of star schemas is that information can be classified into two groups: facts and dimensions. Facts are the core data element being analyzed.
For example, units of individual items sold are facts,

while dimensions are attributes about the facts.
For example, dimensions are the product types purchased and the date of purchase (see Fig 9.1).
facts (UNITS) through a set of dimensions (MARKETS, PRODUCTS, PERIOD). It's-important to notice that, in the typical star schema, the fact table is much larger than any of its dimension tables. This point becomes an important consideration of the performance issues associated with star schemas.
52
53
Potential performance problems with star schemas

Indexing, using indexes can enforce the uniqueness of the keys It requires multiple metadata definitions (one for each key component) to define a single relationship (table); this adds to the design complexity, and sluggishness in performance. Since the fact table must carry all key components as part of its primary key, addition or deletion of levels in the hierarchy will require physical modification of the affected table, which is a time-consuming process that limits flexibility. Carrying all the segments of the compound dimensional key in the fact table increases the size of the index, thus impacting both performance and scalability.
Metadata
Metadata is one of the most important aspects of data warehousing. It is data about data stored in the warehouse and its users. At a minimum, metadata contains:- The location and description of warehouse system and data components (warehouse objects). Names, definition, structure, and content of the data warehouse and enduser views. Identification of authoritative data sources (systems of record). Integration and transformation rules used to populate the data warehouse; these include the mapping method from operational databases into the warehouse, and algorithms used to convert, enhance, or transform data.
Integration and transformation rules used to deliver data to end-user analytical tools. Subscription information for the information delivery to the analysis subscribers.
Data warehouse operational information, which includes a history of warehouse updates, refreshments, snapshots, versions, ownership authorizations, and extract audit trail.
Metrics used to analyze warehouse usage and performance according end user usage patterns. Security authorizations, access control lists, etc.
56
Metadata Repository
Metadata repository management software can be used to map the source data to the target database, generate code for data transformations, integrate and transform the data, and control moving data to the warehouse. This software, which typically runs on a workstation, enables users to specify how the data should be transformed, such as data mapping,conversion,.and summarization. Metadata is searched by users to find data definitions or subject areas. In other words, metadata provides decision support oriented pointers to warehouse data, and thus provides a logical link between warehouse data and the decision support application.
57
58
Having such metadata repository implemented as a part of the data ware house framework provides the following benefits: It provides a comprehensive suite of tools for enterprise wide metadata management.
It reduces and eliminates information redundancy, inconsistency, and under utilization.

It simplifies management and improves organization, control, and accounting of information assets. It increases identification, understanding, coordination, and utilization of enterprise wide information assets.
It provides effective data administration tools to better manage corporate information assets with full-function data dictionary.
It increases flexibility, control, and reliability of the application development process and accelerates internal application development. It leverages investment in legacy systems with the ability to inventory and utilize existing applications.
It provides a universal relational model for heterogeneous RDBMSs to interact and share information.
It enforces CASE development standards and eliminates redundancy with the ability to share and reuse metadata.
60
Metadata Management
A frequently occurring problem in data warehousing is the inability to communicate to the end user what information resides in the data warehouse and how it can be accessed. The key to providing users and applications with a roadmap to the information stored in the warehouse is the metadata. It can define all data elements and their attributes, data sources and timing, and the rules that govern data use and data transformations. Since metadata describes the information in the warehouse from multiple viewpoints (input, sources, transformation, access, etc.),
61
What data exists in the data warehouse Where to find the data What the original sources of the data are How summarizations were created What transformations were used Who is responsible for correcting errors What queries can be used to access the data How business definitions have changed over time What underlying business assumptions have been made
62
63
Thank You

UNIT-V Data Warehousing, Data Mining &amp; OLAP

Загружено:

Сведения о документе

Исходное описание:

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

UNIT-V Data Warehousing, Data Mining &amp; OLAP

Загружено:

Авторское право:

Доступные форматы

UNIT-V

What is Data Warehousing

Kapil Tomar, IT Deptt. AKGEC

Data Warehouse definition

Kapil Tomar, IT Deptt. AKGEC

Seven data warehouse components

Kapil Tomar, IT Deptt. AKGEC

Kapil Tomar, IT Deptt. AKGEC

Kapil Tomar, IT Deptt. AKGEC

Kapil Tomar, IT Deptt. AKGEC

Sourcing, Acquisition, Cleanup, and Transformation Tools

Kapil Tomar, IT Deptt. AKGEC

The functionality includes

Kapil Tomar, IT Deptt. AKGEC

Kapil Tomar, IT Deptt. AKGEC

Kapil Tomar, IT Deptt. AKGEC

Kapil Tomar, IT Deptt. AKGEC

Kapil Tomar, IT Deptt. AKGEC

Kapil Tomar, IT Deptt. AKGEC

Kapil Tomar, IT Deptt. AKGEC

Kapil Tomar, IT Deptt. AKGEC

Data query and reporting tools

Kapil Tomar, IT Deptt. AKGEC

Kapil Tomar, IT Deptt. AKGEC

Application development tools

Kapil Tomar, IT Deptt. AKGEC

On-line analytical processing tools

Kapil Tomar, IT Deptt. AKGEC

Kapil Tomar, IT Deptt. AKGEC

Kapil Tomar, IT Deptt. AKGEC

Kapil Tomar, IT Deptt. AKGEC

Kapil Tomar, IT Deptt. AKGEC

Data Warehouse Administration and Management

Kapil Tomar, IT Deptt. AKGEC

Information Delivery System

Kapil Tomar, IT Deptt. AKGEC

Information Delivery System

Kapil Tomar, IT Deptt. AKGEC

Information Delivery System

Kapil Tomar, IT Deptt. AKGEC

Building a Data Warehouse

Kapil Tomar, IT Deptt. AKGEC

Nine-Step Method in the Design of a Data Warehouse

Benefits of Data Warehousing

Kapil Tomar, IT Deptt. AKGEC

Kapil Tomar, IT Deptt. AKGEC

Mapping the Data Warehouse to a Multiprocessor Architecture

Kapil Tomar, IT Deptt. AKGEC

Kapil Tomar, IT Deptt. AKGEC

Kapil Tomar, IT Deptt. AKGEC

Kapil Tomar, IT Deptt. AKGEC

Kapil Tomar, IT Deptt. AKGEC

Database Architectures for Parallel Processing

Kapil Tomar, IT Deptt. AKGEC

Kapil Tomar, IT Deptt. AKGEC

Kapil Tomar, IT Deptt. AKGEC

Kapil Tomar, IT Deptt. AKGEC

Parallel DBMS Features

Kapil Tomar, IT Deptt. AKGEC

DBMS Schemas for Decision Support

Kapil Tomar, IT Deptt. AKGEC

Kapil Tomar, IT Deptt. AKGEC

DBMS Schemas for Decision Support

UNIT-V Data Warehousing, Data Mining & OLAP

UNIT-V Data Warehousing, Data Mining & OLAP