Вы находитесь на странице: 1из 12

Data Warehousing

Abdur Rahman Bin Shahid Chittagong University of Engineering & Technology Chittagong, Bangladesh

1. Introduction
In todays business environment, in spite of the rapid growth of competition between business organizations, managements require more and more appropriate information in more and more short time. Because of the advancement of information technology someone might expect most organization to have a well developed information system which can fulfill this requirement. But in reality, due to having ton of data, and often many databases, few organizations have more than a fraction of the information they need. Managers are often frustrated by their inability to access or use the data and information they need. There are two important reasons why an information gap has been created in most organizations. The first reason is the fragmented way of developing the information system-and their supporting databases developed by the organization-for many years. In reality constraints on time and resources cause most organizations to resort to a one-thing-at-a-time approach to developing islands of information systems. This approach certainly produces a bunch of uncoordinated and often inconsistent databases. Usually in most of the cases databases are based on a variety of hardware and software platform. In this environment it is extremely difficult for managers to locate and use accurate information, which must be synthesized across these various systems of record. The second reason is the operational processing based development of the system by most organizations, with little or no thought given to the information or analytical tools needed for processing making. The processing that is performed and the types of data required are very different for these two types of processing. Data warehouse is the bridge between the information gaps, which consolidates and integrates information from various sources and gives it a meaningful format for the appropriate users (Martin, 1997). It supports to make complex business decisions through applications such as the analysis of trends, target marketing, competitive analysis, customer relationship management, and so on. It provides way to meet these needs without disturbing existing operational processing.

2. Data Warehouse Philosophy


A data warehouse is a subject-oriented, integrated, time-variant, nonupdatable collection of data in support of management decision-making processes and business intelligence (Inmon and Hackathorn, 1994). The elements of data warehouse are: y

Subject Orientation: Data in a data warehouse is planned around the business issues
of the enterprise [1]. Operational data is organized by its physical materialization, including file names, job schedules, and application dependencies. A data warehouse presents those data, which reflects major subject areas within the enterprise, not those which reflects the physical manipulation of operational data [2]. The subject orientation of a data warehouse enables it to absorb inevitable changes without severe changes to its architecture [3]. For example, using this warehouse, one can answers questions like "Who was our best customer for this item last year?" This ability to define a data warehouse by subject matter, sales, in this case, makes the data warehouse subject oriented.

Integration: The integration of data in a data warehouse enables the data warehouse
customers to query the data across subject area without traversing other data sources [4]. Data integration occurs in multiple ways, which can be combined into three groups: o Form: Data form defines the types and layouts of data [5]. Disparate business units may express similar data elements in different ways. For example:   Money can be expressed as currency or integer data types. Names can be expressed as First, Last or Last, First.

By integrating the different expressions of similar elements into a single form, data warehouse allows its users to query across business subjects. o Function: Function includes the substance and meaning of the data within the data element. Codes and cryptic values often differ between business units and must be reconciled so the entire organization can control these codes and values. o Grain: Grain refers to the unit of measurement at which data is expressed [6]. Business units may store data using different units of measurements:  Purchasing measures product by the barrel.

Transportation measures product by the shipload.

In this scenario, a data warehouse will reconcile these different units of measurement, which will allow the integration of data from purchasing, transportation, and sales business units. The inference of grain for a data warehouse is that a data warehouse cannot provide data to customers at a grain lower than the grain at which it stored [7]. A data warehouse must integrate the Form, Function, and Grain of data from disparate business units. Once integrated, data warehouse customers can traverse data within business subjects from across the enterprise. y

Nonvolatility: Data, once loaded into a data warehouse cannot be deleted or updated by
the end user [8]. Data can be refreshed to state the historical and current state of the enterprise by inserting new rows [9]. Nonvolatility allows a data warehouse to express the enterprise across time, by retaining that data [10].

y Time Variant: Data in the data warehouse contains a time variant so that they may be
used to study trends and changes [11]. Time variant data allows a data warehouse to express the enterprise as of a moment in time [12]. Historical data allows a data warehouse to express the enterprise from various historical contexts.

3. Data Warehouse Architecture


The architecture of data warehouse is shown in two levels- conceptual level and physical level [13]. Figure 1 depicts the conceptual architecture which includes data sources, management components, storage area, end user presentation tools and users. The data sources supply data to the warehouse. The data of interest is queried by the management components and is computed, stored and updated in the warehouse. Data stored in the warehouse is then accessed by OLAP server, processed in a multidimensional way, and supplied to the users to be analyzed. Metadata, which is in a separate repository from warehouse data, creates links among users, data sources, and warehouse data. The metadata base stores information for transformation, integration, storage, and usage of warehouse data. All the

components integrate and update warehouse data, serve users with data analysis functions, and perform the tasks of decision support.

Figure 1: Conceptual warehouse architecture

The physical level architecture is shown in figure 2 which is divided into three tiers: first-tier, second tier and third-tier.

Figure 2: Architecture of Data Warehouse


5

3.1 Components of DW
y Operational data sources for the DW are supplied from mainframe operational data held in first generation hierarchical and network databases, departmental data held in proprietary file systems, private data held on workstations and private servers and external systems such as the Internet, commercially available DB, or DB associated with and organizations suppliers or customers. y Operational datastore(ODS) is a repository of current and integrated operational data used for analysis. It is often structured and supplied with data in the same way as the data warehouse, but may in fact simply act as a staging area for data to be moved into the warehouse. y Load manager also called the frontend component, it performs all the operations associated with the extraction and loading of data into the warehouse. These operations include simple transformations of the data to prepare the data for entry into the warehouse. y Warehouse manager performs all the operations associated with the management of the data in the warehouse. The operations performed by this component include analysis of data to ensure consistency, transformation and merging of source data, creation of indexes and views, generation of denormalizations and aggregations, and archiving and backing-up data. y query manager also called backend component, it performs all the operations associated with the management of user queries. The operations performed by this component include directing queries to the appropriate tables and scheduling the execution of queries. y Meta-data is used for a variety of purposes and the management of it, is a critical issue in achieving a fully integrated data warehouse. The major purpose of meta-data is to show the pathway back to where the data began, so that the warehouse administrators know the history of any item in the warehouse. The meta-data associated with data transformation and loading must describe the source data and any changes that were made to the data. The meta-data associated with data management describes the data as it

is stored in the warehouse. The meta-data is required by the query manager to generate appropriate queries, also is associated with the user of queries.

3.2 Data Mart


Data mart is a subset of a data warehouse that supports the requirements of particulardepartment or business function its contents are found from ETL processes. The characteristics that differentiate data marts and data warehouses include: y A data mart focuses on only the requirements of users associated with one department or business function. y y Data marts do not normally contain detailed operational data, unlike data warehouses. As data marts contain less data compared with data warehouses, data marts are more easily understood and navigated. There are several reasons for creating data mart: y y To give users access to the data they need to analyze most often. To provide data in a form that matches the collective view of the data by a group of users in a department or business function. y To improve end-user response time due to the reduction in the volume of data to be accessed. y To provide appropriately structured data as ditated by the requirements of end-user access tools. y The cost of implementing data marts is normally less than that required to establish a data warehouse.

3.3 On-Line Analytical Processing (OLAP)


On-Line Analytical Processing (OLAP) is the use of a set of graphical tools that providers users multidimensional views of their data and allows them to analyze the data using simple windowing techniques. y Relational OLAP (ROLAP) tools use vitiations of SQL and view the database as a traditional relational database. It accesses the data warehouse or data mart directly.
7

Multidimensional OLAP (MOLAP) loads data into as intermediate structure, usually a three- or higher dimensional array. It is important to note with MOLAP that data are not simply viewed as a multidimensional array, but rather a MOLAP data mart is created by extracting data from the data warehouse or data mart and then storing the data in specialize separate data store through which data can be viewed only through a multidimensional structure. There are two types of MOLAP operations: Slicing a cube and drill-down.

Figure 3: Multi-dimensional cube 3.4 Data-Mining Data-Mining is knowledge discovery using a sophisticated blend of techniques from traditional statistics, artificial intelligence, and computer graphics. Explanatory, confirmatory and exploratory are the three main goals of data mining. Using data-mining technique one can answer, Why sales of pickup trucks have increased in Colorado, whether two-income families are more likely to buy family medical coverage than single-income families or what spending patterns are likely to accompany credit card fraud type queries. Case-based reasoning, Rule discovery, Signal processing, neural nets etc are data-mining techniques.

4. Characteristics of Data warehouse data


Status versus Event Data: An event is a database action (create, update, or delete) that results from a transaction. While status represents status of data (e.g. bank account after or before withdrawal). In practice most of the data stored in databases are status data though both types of
8

data can be stored. A data warehouse likely contains a history or snapshots of status data or a summary of transaction or event data. Event data can be stored for a defined period but are deleted or archived to save storage. Transient versus Periodic Data: In data warehouses it is often necessary to maintain a record of when events occurred in the past. This is necessary, for example, to compare sales or inventory levels on a particular date or during a particular period with the previous years sales on the same date or during the same period. Transient data are data in which changes to existing records are written over previous records, thus destroying the previous data content. On the other hand, periodic data are data that are never physically altered or deleted once added to the store.

5. Data Quality
The success of a data warehouse depends on quality of data [14]. The policies related to the quality of data are: y Data Accuracy: Individuals who enter, update or delete data are responsible for quality and accuracy of the data. y Data Consistency & Integrity: Regardless of where data is stored within an organization, the same data must be consistent throughout the organization. Data must have enterprise-wide integrity. y Data Capture: Data will be captured in electronic form as close to the source of origin as possible. y Data Exchange: Once captured, data will be stored and exchanged using electronic means to avoid manual transcription and re-entry.

6. Data Warehouse Maintenance Issues


After deployment of DW the maintenance is a very big issue. Post implementation maintenance issues have to be addressed because they are critical to the success of a data warehousing project in the long run. Common maintenance issues such as improving query performance need to be tackled right from the rollout to ensure users are satisfied with performance. The important issues of maintaining the data warehouse are:

y y y y y y

Study of existing Data Warehouses / Data Marts. Technology / Tools evaluation and recommendation. Roadmap for re-architecture/ re-engineering. Refining the Data Warehouse / Data Mart architecture and Data Model. Introducing data from new source systems to existing Data Warehouses / Data Marts. Performance tuning and upgrades.

7. Performance regulation method


While the implementation of the warehouse is completed progress monitoring must be continued against the agreed on success criteria. The data warehouse team must ensure that the existing implementations remain on track and continue to address the needs of business. Performance issues in data warehousing are based on the access performance for running queries and incremental loading of snapshot changes from the source systems. There are some notions on which the performance of warehouse relies [15]: y Network Management: As most of data warehousing system builds on heterogeneous platform, so management of network is very vital issue.
y

Capacity arrangement: Capacity planning refers to determining the required future configuration of hardware and software for a network, datacenter or web site. There are numerous capacity planning tools in the market used to monitor and analyze the performance of the current hardware and software.

Query Management: In a data warehousing environment users queries need to be very efficiently and carefully written as some tables of the data warehouse are very huge and queries posted against these tables could days or weeks to complete. To have an efficient query management system most of the predefined and ad hoc queries.

Software and Hardware Issues: Updates to the data warehouse are inevitable; so too will be changes to package software, hardware servers, and the supporting network infrastructure. o Installing new software releases, patches, hardware components or upgrades, and network connections (logical and physical) directly in the production environment.

10

o Installing new software versions, hardware upgrades, and network improvement tasks in a temporary test environment and migrates or reconnects to production once certification testing has concluded.

8. Conclusion
Data warehousing becomes the leading and most trustworthy technology used today by companies for planning, forecasting, and management by eliminating redundancy and avoiding possible inconsistency in the data stores. Though it was thought that data warehouse will change the information systems of companies at a great pace after the development of the concept of data warehousing, but unfortunately its not the reality. A major reason for data warehouse project failures is poor maintenance. Without proper maintenance its mostly impossible to get desired results from a data warehouse. Unlike operational systems data warehouses need a lot more maintenance and a support team of qualified professionals is needed to take care of the issues that arise after its deployment including data extraction, data loading, network management, training and communication, query management and some other related tasks. The three tier architecture provides warehousing systems with generality, extendibility, efficiency, scalability, and intelligence. The muscular quality of data has already ensured the brilliant future of data warehousing in weather system, highway management system [16] like very large areas as well as business areas.

References
1. Mark Peco, TDWI Data Warehousing Concepts and Principles: An Introduction to
theField of Data Warehousing, TDWI World Conference (The Data Warehousing Institute:Renton, WA, 2004).

2. William H. Inmon and Richard D. Hackathorn, Using the Data Warehouse (New York:
John Wiley & Sons, 1994).

3. William H. Inmon, Claudia Imhoff, and Ryan Sousa, Corporate Information Factory
(New York: John Wiley & Sons, 1998).

4. Inmon and Hackathorn, Using the Data Warehouse. 5. Inmon, Imhoff, and Sousa, Corporate Information Factory. 6. Inmon and Hackathorn, Using the Data Warehouse.
11

7. William H. Inmon, Building the Data Warehouse, 2nd ed. (New York: John Wiley &
Sons, 1996).

8. Peco, TDWI Data Warehousing Concepts and Principles. 9. Inmon and Hackathorn, Using the Data Warehouse. 10. Inmon, Imhoff, and Sousa, Corporate Information Factory. 11. Peco, TDWI Data Warehousing Concepts and Principles. 12. Inmon and Hackathorn, Using the Data Warehouse. 13. Jixue Liu Millist Vincent, An architecture for data warehouse systems 14. Rifaie, Kianmehr, Alhajj, Ridley, Data Warehouse Architecture and Design (IEEE IRI
2008, July 13-15, 2008, Las Vegas, Nevada, USA)

15. Reddy, A.Lavanya, Dr.V.Khanna, Reddy, research issues on data warehouse


maintenance (International Conference on Advanced Computer Control, 2008)

16. Qian, Li-jun, The Architecture and Design Strategy for Data Warehouse of Highway
Management (2009 Second International Conference on Intelligent Computation Technology and Automation).

12

Вам также может понравиться