Академический Документы
Профессиональный Документы
Культура Документы
DATA Warehousing. Data Warehousing Components. Building a Data Warehouse. Mapping the Data Warehouse to a Multiprocessor Architecture. DBMS Schemas for Decision Support. Data Extraction, cleanup & Transformation Tools. Metadata. Data Mining: Introduction to data mining
Kapil Tomar, IT Deptt. AKGEC 1
Typically, the source data for the warehouse is coming from the operational applications [an exception might be an operational data store (ODS), As the data enters the data warehouse, it is transformed into an integrated structure format. The transformation process may involve conversion, summarization, filtering" and condensation of data. Because data within the data warehouse contains a large historical component (sometimes covering 5 to 10 years), the data warehouse must be capable of holding and managing large volumes of data as well as different data structures for the same database over time.
The data sourcing, cleanup, extract, transformation, and migration tools have to deal with some significant issues as follows:
Database heterogeneity. DBMSs are very different in data models, data access language, data navigation, operations, concurrency, integrity, recovery, etc.
Data heterogeneity. This is the difference in the way data is defined and used in different models- homonyms, synonyms, unit incompatibility different attributes for the same entity, and different ways of modeling the same fact.
10
Metadata
Metadata is data about data that describes the data warehouse. It is used for building, maintaining, managing, and using the data warehouse. Metadata can be classified into Technical metadata, Business metadata
11
Technical metadata, which contains information about warehouse data for use by warehouse designers and administrators when carrying out warehouse development and management tasks. Technical meta data documents include Information about data source Transformation descriptions, i.e., the mapping method from operational databases into the warehouse, and algorithms used to convert, enhance or transform data Warehouse object and data structure definitions for data targets The rules used to perform data cleanup and data enhancement Data mapping operations when capturing data from source systems and applying it to the target warehouse database Access authorization, backup history, archive history, information delivery history, data acquisition history, data access, etc.
12
Business metadata contains information that gives users an easy-to understand perspective of the information stored in the data-ware house. Business metadata documents information about Subject areas and information object type, including queries, reports, image video, and/or audio clips. Internet home pages. Other information to support all data warehousing components. For example, the information related to the information delivery system (see Sec. 6.8) should include subscription information, scheduling information, details of delivery destinations, and the business query objects such as predefined queries, reports, and analyses.
Data warehouse operational information, e.g., data history (snapshots, versions), ownership, extract audit trail, usage data
Kapil Tomar, IT Deptt. AKGEC 13
Metadata repository management software can be used to map the source data to the target database, generate code for data transformations, integrate and transform the data, and control moving data to the warehouse. One of the important functional components of the metadata repository is the information directory.
14
From a technical requirements point of view, the information directory and the entire metadata repository Should be a gateway to the data warehouse environment, and thus should be accessible from any platform via transparent and seamless connections Should support an easy distribution and replication of its content for high performance and availability Should be searchable by business-oriented key words Should act as a launch platform for end-user data access and analysis tools
15
Should support the sharing of information objects such as queries, reports, data collections, and subscriptions between users Should support a variety of scheduling options for requests against the data warehouse, including on-demand, one-time, repetitive, event-driven, and conditional delivery (in conjunction with the information delivery system) Should support the distribution of the query results to one or more destinations in any of the user-specified formats (in conjunction with the information delivery system) . Should support and provide interfaces to other applications such as email, spreadsheet, and schedulers
16
Access Tools
The principal purpose of data warehousing is to provide information to business users for strategic decision making. These users interact with the data warehouse using front-end tools.
For the purpose of this discussion let's divide these tools into five main groups:
Data query and reporting tools Application development tools Executive information system (EIS) tools On-line analytical processing tools Data mining tools
17
18
Production reporting tools will let companies generate regular operational reports or support high-volume 'batch jobs, such as calculating and printing paychecks. Report writers, on the other hand, are inexpensive desktop tools designed for end users.
2. Managed query tools shield end users from the complexities of SQL and database structures by inserting a metalayer between users and the database. The metalayer is the software that provides subject-oriented views of a database and supports point-and-click creation of SQL.
19
20
21
Data Mining
A critical success factor for any business today is its ability to use information effectively. Knowing this information, an organization can formulate effective business, marketing, and sales strategies; precisely target promotional activity; discover and penetrate new markets; and successfully compete in the marketplace from a position of informed strength.
A relatively new and promising technology aimed at achieving this strategic advantage is known as data mining.
major attraction of data mining is its ability to build predictive rather than retrospective models.
22
Most organizations engage in data mining to Visualize Data Correct Data Discover knowledge. The goal of knowledge discovery is to determine explicit hidden relationships, patterns, or correlations from data stored in an enterprise's database. Specifically data mining can be used to perform:
Segmentation (e.g. group customer records for custom-tailored marketing) Classification (assignment of input data to a predefined class, discovery and understanding of trends, text document classification) Association (discovery of cross-sales opportunities) Preferencing (determining preference of customer's majority)
PACS
Kapil Tomar, IT Deptt. AKGEC 23
Data Marts
However, the term data mart means different things to different people. A rigorous definition of this term is a data store that is subsidiary to a data warehouse of integrated data. The data mart is directed at a partition of data (often called a subject area) that is created for the use of a dedicated group of users. A data mart might, in fact, be a set of denormalized, summarized, or aggregated data. Sometimes, such a set could be placed on the data warehouse database rather than a physically separate store of data. In most instances, however, the data mart is a physically separate store of data and is normally resident on a separate database server, often on the local area network serving a dedicated user group.
24
it is often a necessary and valid solution to a pressing business problem, thus achieving the goal of rapid delivery of enhanced decision support functionality to end users. The business drivers underlying such developments include Extremely urgent user requirements The absence of a budget for a full data warehouse strategy The. absence of a sponsor for an enterprise wide decision support strategy The decentralization of business units The attraction of easy-to-use tools and a mind-sized project
25
In summary, data marts present two problems: (1) scalability in situations where an initial small data mart grows quickly in multiple dimensions and (2) data integration. Therefore, when designing data marts, the organizations should pay close attention to system scalability, data consistency, and manageability issues. The key to a successful data mart strategy is the development of an overall scalable data warehouse architecture; and the key step in that architecture is identifying and implementing the common dimensions.
26
27
28
29
30
31
33
Tangible benefits
Product inventory turnover is improved. Costs of product introduction are decreased with improved selection of target markets. More cost-effective decision making is enabled by separating (ad hoc) query processing from running against operational databases. Better business intelligence is enabled by increased quality and flexibility of market analysis available through multilevel data structures, which may range from detailed to highly summarized. For example, determining the effectiveness of marketing programs allows the elimination of weaker programs and enhancement of stronger ones. Enhanced asset and liability management means that a data warehouse can provide a "big picture of enterprise wide purchasing and inventory patterns, and can indicate otherwise unseen credit exposure and opportunities for cost savings.
Kapil Tomar, IT Deptt. AKGEC 34
Intangible benefits
Improved productivity, by keeping all required data in a single location and eliminating the rekeying of data Reduced redundant processing, support, and software to support overlapping decision support applications Enhanced customer relations through improved knowledge of individual requirements and trends, through customization, Improved communications, and tailored product offerings Enabling business process reengineering - data warehousing can provide useful insights into the work processes themselves, resulting in developing breakthrough ideas for the reengineering of those processes
35
Scale-up-the ability to obtain the same performance on the same request as the database size increases
Kapil Tomar, IT Deptt. AKGEC 36
An additional and important goal is to achieve linear speed-up and scale-up; doubling the number of processors cuts the response time in half (linear speed-up) or provides the same performance on twice as much data (linear scale-up). These goals of linear performance and scalability can be satisfied by parallel hardware architectures, parallel operating systems, and parallel database management systems. Parallel hardware architectures are based on multiprocessor systems designed as a shared-memory model [symmetric multiprocessors (SMPs), shared-disk model, or distributed-memory model [massively parallel processors (MPPs), and clusters of uniprocessors and/or SMPs].
37
Types of parallelism
Horizontal parallelism Vertical parallelism
38
Horizontal parallelism, which means that the database is partitioned across multiple disks, and parallel processing occurs within a specific task (i.e., table scan) that is performed concurrently on different processors against different sets of data. Vertical parallelism, which occurs among different tasks-all component query operations (i.e., scan, join, sort) are executed in parallel in a pipelined fashion. In other words, an output from one task (e.g., scan) becomes an input into another task (e.g., join) as soon as records become available A truly parallel DBMS should support both horizontal and vertical types of parallelism concurrently (see Fig. 8.1, case 4).
Kapil Tomar, IT Deptt. AKGEC 39
40
Data partitioning
Hash partitioning Key range partitioning Schema partitioning User-defined partitioning
41
Data partitioning
Hash partitioning. A hash algorithm is used to calculate the partition number (hash value) based on the value of the partitioning key for each row. Key range partitioning. Rows are placed and located in the partitions according to the value of the partitioning key (all rows with the key value from A to K are in partition 1, L to T are in partition 2, etc.). Schema partitioning. This is an option not to partition a table across disks; instead, an entire table is placed on one disk, another table is placed on a different disk, etc. This is useful for small reference tables that are more effectively used when replicated in each partition rather than spread across partitions. User-defined partitioning. This is a partitioning method that allows a table to be partitioned on the basis of a user-defined expression (e.g., use state codes to place rows in one of 50 partitions) ..
42
43
Shared-memory architecture
44
Shared-disk architecture
45
Shared-nothing architecture
46
47
48
Indeed, solving modern business problems such as market analysis and financial forecasting requires querycentric database schemas that are array oriented and multidimensional in nature. These business problems are characterized by the need to retrieve large numbers of records from very large data sets (hundreds of gigabytes and even terabytes) and summarize them on the fly.
49
50
Star Schema
The multidimensional view of data that is expressed using relational database semantics is provided by the database schema design called star schema. The basic premise of star schemas is that information can be classified into two groups: facts and dimensions. Facts are the core data element being analyzed.
For example, dimensions are the product types purchased and the date of purchase (see Fig 9.1).
Kapil Tomar, IT Deptt. AKGEC 51
facts (UNITS) through a set of dimensions (MARKETS, PRODUCTS, PERIOD). It's-important to notice that, in the typical star schema, the fact table is much larger than any of its dimension tables. This point becomes an important consideration of the performance issues associated with star schemas.
52
53
Metadata
Metadata is one of the most important aspects of data warehousing. It is data about data stored in the warehouse and its users. At a minimum, metadata contains:- The location and description of warehouse system and data components (warehouse objects). Names, definition, structure, and content of the data warehouse and enduser views. Identification of authoritative data sources (systems of record). Integration and transformation rules used to populate the data warehouse; these include the mapping method from operational databases into the warehouse, and algorithms used to convert, enhance, or transform data.
Kapil Tomar, IT Deptt. AKGEC 55
Integration and transformation rules used to deliver data to end-user analytical tools. Subscription information for the information delivery to the analysis subscribers.
Data warehouse operational information, which includes a history of warehouse updates, refreshments, snapshots, versions, ownership authorizations, and extract audit trail.
Metrics used to analyze warehouse usage and performance according end user usage patterns. Security authorizations, access control lists, etc.
56
Metadata Repository
Metadata repository management software can be used to map the source data to the target database, generate code for data transformations, integrate and transform the data, and control moving data to the warehouse. This software, which typically runs on a workstation, enables users to specify how the data should be transformed, such as data mapping,conversion,.and summarization. Metadata is searched by users to find data definitions or subject areas. In other words, metadata provides decision support oriented pointers to warehouse data, and thus provides a logical link between warehouse data and the decision support application.
57
58
Having such metadata repository implemented as a part of the data ware house framework provides the following benefits: It provides a comprehensive suite of tools for enterprise wide metadata management.
It provides effective data administration tools to better manage corporate information assets with full-function data dictionary.
Kapil Tomar, IT Deptt. AKGEC 59
It increases flexibility, control, and reliability of the application development process and accelerates internal application development. It leverages investment in legacy systems with the ability to inventory and utilize existing applications.
It provides a universal relational model for heterogeneous RDBMSs to interact and share information.
It enforces CASE development standards and eliminates redundancy with the ability to share and reuse metadata.
60
Metadata Management
A frequently occurring problem in data warehousing is the inability to communicate to the end user what information resides in the data warehouse and how it can be accessed. The key to providing users and applications with a roadmap to the information stored in the warehouse is the metadata. It can define all data elements and their attributes, data sources and timing, and the rules that govern data use and data transformations. Since metadata describes the information in the warehouse from multiple viewpoints (input, sources, transformation, access, etc.),
61
What data exists in the data warehouse Where to find the data What the original sources of the data are How summarizations were created What transformations were used Who is responsible for correcting errors What queries can be used to access the data How business definitions have changed over time What underlying business assumptions have been made
62
63
Thank You
Kapil Tomar, IT Deptt. AKGEC 64