Вы находитесь на странице: 1из 64

UNIT-V

DATA Warehousing. Data Warehousing Components. Building a Data Warehouse. Mapping the Data Warehouse to a Multiprocessor Architecture. DBMS Schemas for Decision Support. Data Extraction, cleanup & Transformation Tools. Metadata. Data Mining: Introduction to data mining
Kapil Tomar, IT Deptt. AKGEC 1

What is Data Warehousing


Data Warehousing is an architectural construct of information systems that provides users with current and historical decision support information that is hard to access or present in traditional operational data stores.

Kapil Tomar, IT Deptt. AKGEC

Data Warehouse definition


A formal definition of the data warehouse is offered by W.H. Inmon: A data warehouse is asubject-oriented, integrated, time-variant, nonvolatile collection of data in support of management decisions

Kapil Tomar, IT Deptt. AKGEC

Seven data warehouse components


Data sourcing, cleanup, transformation, and migration tools Metadata repository Warehouse/database technology Data marts Data query, reporting, analysis, and mining tools Data warehouse administration and management Information delivery system

Kapil Tomar, IT Deptt. AKGEC

Typically, the source data for the warehouse is coming from the operational applications [an exception might be an operational data store (ODS), As the data enters the data warehouse, it is transformed into an integrated structure format. The transformation process may involve conversion, summarization, filtering" and condensation of data. Because data within the data warehouse contains a large historical component (sometimes covering 5 to 10 years), the data warehouse must be capable of holding and managing large volumes of data as well as different data structures for the same database over time.

Kapil Tomar, IT Deptt. AKGEC

Kapil Tomar, IT Deptt. AKGEC

Kapil Tomar, IT Deptt. AKGEC

Sourcing, Acquisition, Cleanup, and Transformation Tools


A significant portion of the data warehouse implementation effort is spent extracting data from operational systems and putting it in a format suitable for informational applications that will run off the data warehouse. perform all of the conversions, summarizations, key changes, structural changes, and condensations needed to transform disparate data into information that can be used by the decision support tool. Removing unwanted data from operational databases

Kapil Tomar, IT Deptt. AKGEC

The functionality includes


Removing unwanted data from operational databases Converting to common data names and definitions Calculating summaries and derived data Establishing defaults for missing data Accommodating source data definition changes

Kapil Tomar, IT Deptt. AKGEC

The data sourcing, cleanup, extract, transformation, and migration tools have to deal with some significant issues as follows:

Database heterogeneity. DBMSs are very different in data models, data access language, data navigation, operations, concurrency, integrity, recovery, etc.
Data heterogeneity. This is the difference in the way data is defined and used in different models- homonyms, synonyms, unit incompatibility different attributes for the same entity, and different ways of modeling the same fact.

Kapil Tomar, IT Deptt. AKGEC

10

Metadata
Metadata is data about data that describes the data warehouse. It is used for building, maintaining, managing, and using the data warehouse. Metadata can be classified into Technical metadata, Business metadata

Kapil Tomar, IT Deptt. AKGEC

11

Technical metadata, which contains information about warehouse data for use by warehouse designers and administrators when carrying out warehouse development and management tasks. Technical meta data documents include Information about data source Transformation descriptions, i.e., the mapping method from operational databases into the warehouse, and algorithms used to convert, enhance or transform data Warehouse object and data structure definitions for data targets The rules used to perform data cleanup and data enhancement Data mapping operations when capturing data from source systems and applying it to the target warehouse database Access authorization, backup history, archive history, information delivery history, data acquisition history, data access, etc.

Kapil Tomar, IT Deptt. AKGEC

12

Business metadata contains information that gives users an easy-to understand perspective of the information stored in the data-ware house. Business metadata documents information about Subject areas and information object type, including queries, reports, image video, and/or audio clips. Internet home pages. Other information to support all data warehousing components. For example, the information related to the information delivery system (see Sec. 6.8) should include subscription information, scheduling information, details of delivery destinations, and the business query objects such as predefined queries, reports, and analyses.

Data warehouse operational information, e.g., data history (snapshots, versions), ownership, extract audit trail, usage data
Kapil Tomar, IT Deptt. AKGEC 13

Metadata repository management software can be used to map the source data to the target database, generate code for data transformations, integrate and transform the data, and control moving data to the warehouse. One of the important functional components of the metadata repository is the information directory.

Kapil Tomar, IT Deptt. AKGEC

14

From a technical requirements point of view, the information directory and the entire metadata repository Should be a gateway to the data warehouse environment, and thus should be accessible from any platform via transparent and seamless connections Should support an easy distribution and replication of its content for high performance and availability Should be searchable by business-oriented key words Should act as a launch platform for end-user data access and analysis tools

Kapil Tomar, IT Deptt. AKGEC

15

Should support the sharing of information objects such as queries, reports, data collections, and subscriptions between users Should support a variety of scheduling options for requests against the data warehouse, including on-demand, one-time, repetitive, event-driven, and conditional delivery (in conjunction with the information delivery system) Should support the distribution of the query results to one or more destinations in any of the user-specified formats (in conjunction with the information delivery system) . Should support and provide interfaces to other applications such as email, spreadsheet, and schedulers

Kapil Tomar, IT Deptt. AKGEC

16

Access Tools
The principal purpose of data warehousing is to provide information to business users for strategic decision making. These users interact with the data warehouse using front-end tools.

For the purpose of this discussion let's divide these tools into five main groups:
Data query and reporting tools Application development tools Executive information system (EIS) tools On-line analytical processing tools Data mining tools

Kapil Tomar, IT Deptt. AKGEC

17

Data query and reporting tools


This category can be further divided into two groups: 1. reporting tools and 2. managed query tools. 1. Reporting tools can be divided into i. production reporting tools and ii. desktop report writers.

Kapil Tomar, IT Deptt. AKGEC

18

Production reporting tools will let companies generate regular operational reports or support high-volume 'batch jobs, such as calculating and printing paychecks. Report writers, on the other hand, are inexpensive desktop tools designed for end users.

2. Managed query tools shield end users from the complexities of SQL and database structures by inserting a metalayer between users and the database. The metalayer is the software that provides subject-oriented views of a database and supports point-and-click creation of SQL.

Kapil Tomar, IT Deptt. AKGEC

19

Application development tools


in-house application development PowerBuilder from PowerSoft, Visual Basic from Microsoft, Forte from Forte Software, and Business Objects from Business Objects.

Kapil Tomar, IT Deptt. AKGEC

20

On-line analytical processing tools


On-line analytical processing (OLAP) tools. These tools are based on the concepts of multidimensional databases and allow a sophisticated user to analyze the data using elaborate, multidimensional views. Typically business applications for these tools include product performance and profitability, effectiveness of a sales program or a marketing campaign, sales forecasting, and capacity planning.

Kapil Tomar, IT Deptt. AKGEC

21

Data Mining
A critical success factor for any business today is its ability to use information effectively. Knowing this information, an organization can formulate effective business, marketing, and sales strategies; precisely target promotional activity; discover and penetrate new markets; and successfully compete in the marketplace from a position of informed strength.

A relatively new and promising technology aimed at achieving this strategic advantage is known as data mining.
major attraction of data mining is its ability to build predictive rather than retrospective models.

Kapil Tomar, IT Deptt. AKGEC

22

Most organizations engage in data mining to Visualize Data Correct Data Discover knowledge. The goal of knowledge discovery is to determine explicit hidden relationships, patterns, or correlations from data stored in an enterprise's database. Specifically data mining can be used to perform:
Segmentation (e.g. group customer records for custom-tailored marketing) Classification (assignment of input data to a predefined class, discovery and understanding of trends, text document classification) Association (discovery of cross-sales opportunities) Preferencing (determining preference of customer's majority)

PACS
Kapil Tomar, IT Deptt. AKGEC 23

Data Marts
However, the term data mart means different things to different people. A rigorous definition of this term is a data store that is subsidiary to a data warehouse of integrated data. The data mart is directed at a partition of data (often called a subject area) that is created for the use of a dedicated group of users. A data mart might, in fact, be a set of denormalized, summarized, or aggregated data. Sometimes, such a set could be placed on the data warehouse database rather than a physically separate store of data. In most instances, however, the data mart is a physically separate store of data and is normally resident on a separate database server, often on the local area network serving a dedicated user group.

Kapil Tomar, IT Deptt. AKGEC

24

it is often a necessary and valid solution to a pressing business problem, thus achieving the goal of rapid delivery of enhanced decision support functionality to end users. The business drivers underlying such developments include Extremely urgent user requirements The absence of a budget for a full data warehouse strategy The. absence of a sponsor for an enterprise wide decision support strategy The decentralization of business units The attraction of easy-to-use tools and a mind-sized project

Kapil Tomar, IT Deptt. AKGEC

25

In summary, data marts present two problems: (1) scalability in situations where an initial small data mart grows quickly in multiple dimensions and (2) data integration. Therefore, when designing data marts, the organizations should pay close attention to system scalability, data consistency, and manageability issues. The key to a successful data mart strategy is the development of an overall scalable data warehouse architecture; and the key step in that architecture is identifying and implementing the common dimensions.

Kapil Tomar, IT Deptt. AKGEC

26

Data Warehouse Administration and Management


Security and priority management Monitoring updates from multiple sources Data quality checks Managing and updating metadata Auditing and reporting data warehouse usage and status (for managing the response time and resource utilization, and providing chargeback information) Purging data Replicating, subsetting, and distributing data Backup and recovery Data warehouse storage management [e.g., capacity planning, hierarchical storage management (HSM), purging of aged data]

Kapil Tomar, IT Deptt. AKGEC

27

Information Delivery System


The information delivery component is used to enable the process of subscribing for data warehouse information and having it delivered to one or more destinations of choice according to some user-specIfIed scheduling algorithm.

Kapil Tomar, IT Deptt. AKGEC

28

Information Delivery System


The information delivery component is used to enable the process of subscribing for data warehouse information and having it delivered to one or more destinations of choice according to some user-specIfIed scheduling algorithm. In other words, the infrormation delivery system distributes warehousestored data and other information objects to other data warehouses and end-user products such as spreadsheets and local databases.

Kapil Tomar, IT Deptt. AKGEC

29

Information Delivery System


Delivery of information may be based on time of day, or on a completion of an external event. The value of data warehousing is maximized when the right information gets into the hands of those individuals who need it, where they need It, and when they need it the most.

Kapil Tomar, IT Deptt. AKGEC

30

Building a Data Warehouse

Kapil Tomar, IT Deptt. AKGEC

31

Nine-Step Method in the Design of a Data Warehouse


1. Choosing the subject matter 2. Deciding what a fact table represents 3. Identifying and conforming the dimensions 4. Choosing the facts 5. Storing precalculations in the fact table 6. Rounding out the dimension tables 7. Choosing the duration of the database 8. The need to track slowly changing dimensions 9. Deciding the query priorities and the query modes
Kapil Tomar, IT Deptt. AKGEC 32

Benefits of Data Warehousing


Locating the right information Presentation of information (reports, graphs) Testing of hypothesis Discovery of information Sharing the analysis

Kapil Tomar, IT Deptt. AKGEC

33

Tangible benefits
Product inventory turnover is improved. Costs of product introduction are decreased with improved selection of target markets. More cost-effective decision making is enabled by separating (ad hoc) query processing from running against operational databases. Better business intelligence is enabled by increased quality and flexibility of market analysis available through multilevel data structures, which may range from detailed to highly summarized. For example, determining the effectiveness of marketing programs allows the elimination of weaker programs and enhancement of stronger ones. Enhanced asset and liability management means that a data warehouse can provide a "big picture of enterprise wide purchasing and inventory patterns, and can indicate otherwise unseen credit exposure and opportunities for cost savings.
Kapil Tomar, IT Deptt. AKGEC 34

Intangible benefits
Improved productivity, by keeping all required data in a single location and eliminating the rekeying of data Reduced redundant processing, support, and software to support overlapping decision support applications Enhanced customer relations through improved knowledge of individual requirements and trends, through customization, Improved communications, and tailored product offerings Enabling business process reengineering - data warehousing can provide useful insights into the work processes themselves, resulting in developing breakthrough ideas for the reengineering of those processes

Kapil Tomar, IT Deptt. AKGEC

35

Mapping the Data Warehouse to a Multiprocessor Architecture


The organizations that embarked on data warehousing development deal with ever-increasing amounts of data. Generally speaking, the size of a data warehouse rapidly approaches the point where the search for better performance and scalability becomes a real necessity. This search is pursuing two goals: Speed-up-the ability to execute the same request on the same amount of data in less time

Scale-up-the ability to obtain the same performance on the same request as the database size increases
Kapil Tomar, IT Deptt. AKGEC 36

An additional and important goal is to achieve linear speed-up and scale-up; doubling the number of processors cuts the response time in half (linear speed-up) or provides the same performance on twice as much data (linear scale-up). These goals of linear performance and scalability can be satisfied by parallel hardware architectures, parallel operating systems, and parallel database management systems. Parallel hardware architectures are based on multiprocessor systems designed as a shared-memory model [symmetric multiprocessors (SMPs), shared-disk model, or distributed-memory model [massively parallel processors (MPPs), and clusters of uniprocessors and/or SMPs].

Kapil Tomar, IT Deptt. AKGEC

37

Types of parallelism
Horizontal parallelism Vertical parallelism

Kapil Tomar, IT Deptt. AKGEC

38

Horizontal parallelism, which means that the database is partitioned across multiple disks, and parallel processing occurs within a specific task (i.e., table scan) that is performed concurrently on different processors against different sets of data. Vertical parallelism, which occurs among different tasks-all component query operations (i.e., scan, join, sort) are executed in parallel in a pipelined fashion. In other words, an output from one task (e.g., scan) becomes an input into another task (e.g., join) as soon as records become available A truly parallel DBMS should support both horizontal and vertical types of parallelism concurrently (see Fig. 8.1, case 4).
Kapil Tomar, IT Deptt. AKGEC 39

Kapil Tomar, IT Deptt. AKGEC

40

Data partitioning
Hash partitioning Key range partitioning Schema partitioning User-defined partitioning

Kapil Tomar, IT Deptt. AKGEC

41

Data partitioning
Hash partitioning. A hash algorithm is used to calculate the partition number (hash value) based on the value of the partitioning key for each row. Key range partitioning. Rows are placed and located in the partitions according to the value of the partitioning key (all rows with the key value from A to K are in partition 1, L to T are in partition 2, etc.). Schema partitioning. This is an option not to partition a table across disks; instead, an entire table is placed on one disk, another table is placed on a different disk, etc. This is useful for small reference tables that are more effectively used when replicated in each partition rather than spread across partitions. User-defined partitioning. This is a partitioning method that allows a table to be partitioned on the basis of a user-defined expression (e.g., use state codes to place rows in one of 50 partitions) ..

Kapil Tomar, IT Deptt. AKGEC

42

Database Architectures for Parallel Processing


Shared-memory architecture Shared-disk architecture Shared-nothing architecture

Kapil Tomar, IT Deptt. AKGEC

43

Shared-memory architecture

Kapil Tomar, IT Deptt. AKGEC

44

Shared-disk architecture

Kapil Tomar, IT Deptt. AKGEC

45

Shared-nothing architecture

Kapil Tomar, IT Deptt. AKGEC

46

Parallel DBMS Features


Scope and techniques of parallel DBMS operations Optimizer implementation Application transparency The parallel environment. DBMS management tools Price /performance

Kapil Tomar, IT Deptt. AKGEC

47

DBMS Schemas for Decision Support


Data warehousing projects were forced to choose between a data model and a corresponding database schema that is intuitive for analysis but performs poorly and a model-schema that performs better but is not well suited for analysis. The schema methodology that is gaining widespread acceptance for data warehousing is the star schema.

Kapil Tomar, IT Deptt. AKGEC

48

Indeed, solving modern business problems such as market analysis and financial forecasting requires querycentric database schemas that are array oriented and multidimensional in nature. These business problems are characterized by the need to retrieve large numbers of records from very large data sets (hundreds of gigabytes and even terabytes) and summarize them on the fly.

Kapil Tomar, IT Deptt. AKGEC

49

DBMS Schemas for Decision Support


Star Schema Potential performance problems with star schemas

Kapil Tomar, IT Deptt. AKGEC

50

Star Schema
The multidimensional view of data that is expressed using relational database semantics is provided by the database schema design called star schema. The basic premise of star schemas is that information can be classified into two groups: facts and dimensions. Facts are the core data element being analyzed.

For example, units of individual items sold are facts,


while dimensions are attributes about the facts.

For example, dimensions are the product types purchased and the date of purchase (see Fig 9.1).
Kapil Tomar, IT Deptt. AKGEC 51

facts (UNITS) through a set of dimensions (MARKETS, PRODUCTS, PERIOD). It's-important to notice that, in the typical star schema, the fact table is much larger than any of its dimension tables. This point becomes an important consideration of the performance issues associated with star schemas.

Kapil Tomar, IT Deptt. AKGEC

52

Kapil Tomar, IT Deptt. AKGEC

53

Potential performance problems with star schemas


Indexing, using indexes can enforce the uniqueness of the keys It requires multiple metadata definitions (one for each key component) to define a single relationship (table); this adds to the design complexity, and sluggishness in performance. Since the fact table must carry all key components as part of its primary key, addition or deletion of levels in the hierarchy will require physical modification of the affected table, which is a time-consuming process that limits flexibility. Carrying all the segments of the compound dimensional key in the fact table increases the size of the index, thus impacting both performance and scalability.
Kapil Tomar, IT Deptt. AKGEC 54

Metadata
Metadata is one of the most important aspects of data warehousing. It is data about data stored in the warehouse and its users. At a minimum, metadata contains:- The location and description of warehouse system and data components (warehouse objects). Names, definition, structure, and content of the data warehouse and enduser views. Identification of authoritative data sources (systems of record). Integration and transformation rules used to populate the data warehouse; these include the mapping method from operational databases into the warehouse, and algorithms used to convert, enhance, or transform data.
Kapil Tomar, IT Deptt. AKGEC 55

Integration and transformation rules used to deliver data to end-user analytical tools. Subscription information for the information delivery to the analysis subscribers.

Data warehouse operational information, which includes a history of warehouse updates, refreshments, snapshots, versions, ownership authorizations, and extract audit trail.
Metrics used to analyze warehouse usage and performance according end user usage patterns. Security authorizations, access control lists, etc.

Kapil Tomar, IT Deptt. AKGEC

56

Metadata Repository
Metadata repository management software can be used to map the source data to the target database, generate code for data transformations, integrate and transform the data, and control moving data to the warehouse. This software, which typically runs on a workstation, enables users to specify how the data should be transformed, such as data mapping,conversion,.and summarization. Metadata is searched by users to find data definitions or subject areas. In other words, metadata provides decision support oriented pointers to warehouse data, and thus provides a logical link between warehouse data and the decision support application.

Kapil Tomar, IT Deptt. AKGEC

57

Kapil Tomar, IT Deptt. AKGEC

58

Having such metadata repository implemented as a part of the data ware house framework provides the following benefits: It provides a comprehensive suite of tools for enterprise wide metadata management.

It reduces and eliminates information redundancy, inconsistency, and under utilization.


It simplifies management and improves organization, control, and accounting of information assets. It increases identification, understanding, coordination, and utilization of enterprise wide information assets.

It provides effective data administration tools to better manage corporate information assets with full-function data dictionary.
Kapil Tomar, IT Deptt. AKGEC 59

It increases flexibility, control, and reliability of the application development process and accelerates internal application development. It leverages investment in legacy systems with the ability to inventory and utilize existing applications.

It provides a universal relational model for heterogeneous RDBMSs to interact and share information.
It enforces CASE development standards and eliminates redundancy with the ability to share and reuse metadata.

Kapil Tomar, IT Deptt. AKGEC

60

Metadata Management
A frequently occurring problem in data warehousing is the inability to communicate to the end user what information resides in the data warehouse and how it can be accessed. The key to providing users and applications with a roadmap to the information stored in the warehouse is the metadata. It can define all data elements and their attributes, data sources and timing, and the rules that govern data use and data transformations. Since metadata describes the information in the warehouse from multiple viewpoints (input, sources, transformation, access, etc.),

Kapil Tomar, IT Deptt. AKGEC

61

What data exists in the data warehouse Where to find the data What the original sources of the data are How summarizations were created What transformations were used Who is responsible for correcting errors What queries can be used to access the data How business definitions have changed over time What underlying business assumptions have been made

Kapil Tomar, IT Deptt. AKGEC

62

Kapil Tomar, IT Deptt. AKGEC

63

Thank You
Kapil Tomar, IT Deptt. AKGEC 64

Вам также может понравиться