Академический Документы
Профессиональный Документы
Культура Документы
Approved by AICTE Plot No. 7, Phase-II, Institutional Area, Behind the Grand Hotel, Vasant Kunj, New Delhi 110070 Website: www.srisiim.org
Project Report on Management Information System (208) On Data Warehousing and Data Mining
Submitted By: Vikram Singh Tomar (160) Udit Kumar (155) Vijay Krishna (158) (PGDM 2013-2015)
Declaration
We hereby declare that the following project report of (Data Warehousing and Data Mining) is an authentic work done by us. This is to declare that all work indulged in the completion of this work such as research, analysis of activities of an organization is a profound and honest work of ours.
ACKNOWLEDGEMENT
We would like to express my hearty gratitude to my faculty guide, Prof. N Venkatesan for giving us the opportunity to prepare a project report on Project Report on Data Warehosing and Data Mining and for his valuable guidance and sincere cooperation, which helped us in completing this project.
Vikram Singh Tomar Udit Kumar Vijay Krishna PGDM Batch (2013-2015) SRI SIIM
INDEX
1. ABSTRACT 2. DATA WAREHOUSING Introduction Need of Data Warehousing Purpose of Data Warehousing Characteristics Life cycle Components of a data warehouse Define Online Analytical Processing Tools and technologies Applications 3. Understand Data Marts Introduction Implementation of a Data Mart Maintenance of a Data Mart Development approaches in a Data Mart
5. Data Mining Introduction Types of Data Mining Major elements of Data Mining Data Mining: A KDD process Steps in KDD process Methods of Data Mining
6. Conclusion 7. Bibliography
The following figure shows access of an OLTP database directly by an OLTP transactions application and analysis process together.
The following figure shows the process of accessing offline replicated database by analysis users.
The following figure shows the creation of smaller data warehouses or data marts that an application analysis user uses to make a decision.
Multiple data mart architecture leads to creation of an Enterprise Data Warehouse (EDW) that accumulates data from more than one OLTP system and provides cumulated and clean data for creation of any kind of a data mart.
The following figure shows an EDW that has been created from OLTP database, which in turn further creates clean, cumulated, and specific objective data marts.
The following figure shows the schematic representation of the functional parts of an OLTP system and data warehouse.
RDBMS
UPDAT ES QUERIES
Applications:
Online Transaction Processing: OLTP systems are the major kinds of enterprise applications: Examples: Order entry systems, Inventory control systems, Reservation systems, Point-ofsale systems, Tracking systems, etc. Executive information system (EIS) : Present information at the highest level of summarization using corporate business measures. They are designed for extreme ease-of-use and, in many cases, only a mouse is required. Graphics are usually generously incorporated to provide at-a-glance indications of performance Decision Support Systems (DSS) : They ideally present information in graphical and tabular form, providing the user with the ability to drill down on selected information. Note the increased detail and data manipulation options presented.
10
Data analysis and arrangement in a data warehouse is done with the help of: Metadata Metadata is the information about data in the data warehouse which is maintained by the OLAP server. OLAP systems: Are used to arrange and analyze data in a data warehouse using the OLAP systems. Are used to extract, clean or scrub, and store data in the data warehouse in a homogeneous form after being collected from various heterogeneous sources.
11
12
The preceding tasks are the responsibilities of the following three roles in a data warehouse: Load Manager Warehouse Manager Query Manager
The detailed tasks of a load manager are: Extracting data from disparate sources Fast-loading extracted data into a temporary database Performing simple data transformations
The various tools used by a load manager for extracting and loading the data are: Fast loader: Used for fast loading of data from operational to temporary database. Copy management tool: Used for simple transformation. Stored procedures: Used for checking and cleaning of data. Shell scripts: Used for automating the processes and scheduling job control for an unattended execution.
13
The detailed tasks of a warehouse manager are: Analyzing data for consistency and referential integrity check. Creating indexes, views, and partitions of the base data. Generating new aggregations. Updating existing aggregations. Creating back-up data. Archiving obsolete data.
The tools used by a warehouse manager are: Stored procedures that create indexes, generate, and upgrade aggregations, as well as, multidimensional schemas. System management tools for backup and archiving data. Data warehouse-specific tools for query-specific analysis. The detailed tasks performed by a query manager are: Directing query to appropriate tables. Scheduling execution of user queries.
14
Query
Query Manager Query Redirection Stored Procedures Query Managment Tool Query Scheduling
Meta Data
The tools used by a query manager are: User access tools or stored procedures for directing queries to appropriate tables. Stored procedures, user access tools, third party software, or database facilities to schedule execution of queries.
15
In an enterprise data warehouse, there can be a collection of smaller data warehouses known as data marts.
Data Mart: Is a specific subset of a data warehouse, stored within its own database. Contains the data required at a department level or for a specific business area of the organization. Makes query processing faster by having less volume of data from a typical data warehouse. Also enables mobility of data due to reduced size of the data.
The following figure shows the creation of several data marts from an enterprise data warehouse.
16
Top down approach: In the top down approach, the data warehouse is created first and the dependent data marts are created after that, as shown in the following figure.
ETS DATAMARTS
Data Warehouse
17
Bottom up approach: In the bottom up approach, the data marts are created first, and these data marts together contribute to the development of the data warehouse, as shown in the following figure.
ETS DATAMARTS
2 Data Warehouse 3
BOTTOM UP APPROACH
Hybrid approach: This approach is a fast and high user-orientation approach, like the
bottom up approach, and maintains data integrity of a data warehouse, like the top-down approach. The following figure shows the hybrid approach of creating a data mart.
ETS DATAMARTS
2 Data Warehouse 3
HYBRID APPROACH
18
19
Describing OLAP
OLAP is a crucial element of an enterprise data warehouse or data mart solution. It fits into data warehousing and data mart strategies to deliver an exceptional and convincing way for data reporting, scrutiny, analysis, modeling, planning, and in an enterprise. OLAP is a process of analyzing and processing data from variant data sources, such as a data warehouse. OLAP is a process of analyzing and processing data from variant data sources, such as a data warehouse.
20
The following table lists down some basic differences between OLTP and OLAP systems.
21
DATA MINING
What is data mining? Data Mining refers to the process of analyzing the data from different perspectives and summarizing it into useful information. Data mining software is one of the numbers of tools used for analyzing data from many different dimensions or angles, categorize it, and summarize the relationship identified. Definition: Data mining is the process of finding correlation or patterns among fields in large relational databases. The process of extracting valid, previously unknown, comprehensible, and actionable information from large databases and using it to make crucial business decision
Different Types of Data Mining: Business, Scientific and Internet Data Mining Five major elements of Data Mining: 1. Extract, transform, & load transaction data on to the data warehouse system. 2. Store and manage data in multidimensional database system. 3. Provide access to business analysts and IT Professionals. 4. Analyze the data by application software. 5. Present the data in useful format such as graph or table.
22
Steps of KDD Process 1. Learning the application domain 2. Relevant prior knowledge and goals of application 3. Creating a target data set: data selection 4. Data cleaning and preprocessing 5. Data reduction and transformation 6. Find useful features, dimensionality or variable reduction, and invariant representation. 7. Choosing functions of data mining 8. Summarization, classification, regression, association, clustering. 9. Choosing the mining algorithm(s) 10. Data mining: search for patterns of interest 11. Pattern evaluation and knowledge presentation 12. Visualization, transformation, removing redundant patterns, etc. 13. Use of discovered knowledge. Methods of Data Mining: 1. Classification 2.Regression 3.Clustering 4.Associative rules 5.Visualization
23
Summary
A data warehouse is a large repository of data, which helps in decision-making process of an enterprise. The four characteristics intrinsic to a data warehouse are: Consolidated and consistent data Subject-oriented data Historical data Non-volatile data In a data warehouse, a specific type of data is used that contains information about types of data known as metadata. The various components of a data warehouse are: Data sources Data preparation area Presentation services Data marts Data warehouse primarily performs the following five tasks: Data extraction Data cleaning Data loading Querying Backup and recovery The following roles perform the above-defined tasks in a data warehouse: Load Manager Warehouse Manager Query Manager
24
The four development approaches in creating a data mart are: Top down approach Bottom up approach Hybrid approach Federated approach The features of OLAP are: Multidimensional views Calculation intensive capabilities Time intelligence
25
CONCLUSION
Data Warehousing provides the means to change the raw data into information for making effective business decisions-the emphasis on information, not data. The Data warehouse is the hub for decision support data. Data mining is a useful tool with multiple algorithms that can be tuned for specific tasks. It can benefit business, medicine, and science. It needs more efficient algorithms to speed up data mining process.
26