Вы находитесь на странице: 1из 26

Sri Sharada Institute Of Indian Management -Research

Approved by AICTE Plot No. 7, Phase-II, Institutional Area, Behind the Grand Hotel, Vasant Kunj, New Delhi 110070 Website: www.srisiim.org

Project Report on Management Information System (208) On Data Warehousing and Data Mining

Submitted To: Prof. N Venkatesan

Submitted By: Vikram Singh Tomar (160) Udit Kumar (155) Vijay Krishna (158) (PGDM 2013-2015)

Declaration

We hereby declare that the following project report of (Data Warehousing and Data Mining) is an authentic work done by us. This is to declare that all work indulged in the completion of this work such as research, analysis of activities of an organization is a profound and honest work of ours.

Place: New Delhi

Vikram Singh Tomar Udit Kumar Vijay Krishna (PGDM 2013-2015)

ACKNOWLEDGEMENT

We would like to express my hearty gratitude to my faculty guide, Prof. N Venkatesan for giving us the opportunity to prepare a project report on Project Report on Data Warehosing and Data Mining and for his valuable guidance and sincere cooperation, which helped us in completing this project.

Vikram Singh Tomar Udit Kumar Vijay Krishna PGDM Batch (2013-2015) SRI SIIM

INDEX
1. ABSTRACT 2. DATA WAREHOUSING Introduction Need of Data Warehousing Purpose of Data Warehousing Characteristics Life cycle Components of a data warehouse Define Online Analytical Processing Tools and technologies Applications 3. Understand Data Marts Introduction Implementation of a Data Mart Maintenance of a Data Mart Development approaches in a Data Mart

4. Describing OLAP Introduction The benefits of OLAP The features of OLAP

5. Data Mining Introduction Types of Data Mining Major elements of Data Mining Data Mining: A KDD process Steps in KDD process Methods of Data Mining

6. Conclusion 7. Bibliography

DATA WAREHOUSING AND DATA MINING ABSTRACT:


Fast, accurate and scalable data analysis techniques are needed to extract useful information from huge pile of data. Data warehouse is a single, integrated source of decision support information formed by collecting data from multiple sources, internal to the organization as well as external, and transforming and summarizing this information to enable improved decision making. Data warehouse is designed for easy access by users to large amounts of information, and data access is typically supported by specialized analytical tools and applications. Typical applications include decision support systems and execution information system. Data mining is the exploration and analysis of large quantities of data in order to discover valid, novel, potentially useful, and ultimately understandable patterns in data. It is An information extraction activity whose goal is to discover hidden facts contained in databases. The process of extracting valid, previously unknown, comprehensible and actionable information from large databases and using it to make crucial business decisions. The project entitled Website Data Mining is an application of data mining which is built for the website developers for their effective creation of websites in internet. Data mining finds patterns and subtle relationships in data and infers rules that allow the prediction of future results. It produces output values for an assigned set of input values. Typical applications include market segmentation, customer profiling, fraud detection, evaluation of retail promotions, and credit risk analysis.

Data Warehousing - An Overview


Everyday increasingly, organizations are analyzing current and historical data to identify useful patterns and support business strategies. A large amount of the right information is the key to survival in todays competitive environment. And this kind of information can be made available only if theres totally Integrated enterprise data warehouse.

What is data warehousing?


According to W.H. Inmon, A data warehouse is a subject-oriented, integrated timevariant, and nonvolatile collection of data in support of managements decision-making process. A data warehouse can be defined as a large central repository of data, which helps in decision making process of an enterprise. It comprises of integrated databases which can be any DBMS, a text, or a flat file. Data warehouse is one of the key components of BI System.

Need for data warehousing:


Prior to business analytical tools, such as OLAP, organizations handled decision support system by accessing data directly from OLTP systems for both transaction and analysis purposes. Large organizations generally use the following three different kinds of processes to extract data from the OLTP systems: Access OLTP database directly for all types of transactions and analysis. Create an offline replicated database from the OLTP database at a predefined regular interval. While the source OLTP system is used for managing daily transactional activities, the replicated database is used for analysis purposes only. Create small data warehouses that satisfy the individual needs of the business users, from the OLTP systems where all the past transactions also get stored.

Purpose of Data Warehousing:


Better business intelligence for end users. Reduction in time to access and analyze information. Consolidation of disparate information sources. Replacement of older, less-responsive decision support systems Faster time to market for products and services

The following figure shows access of an OLTP database directly by an OLTP transactions application and analysis process together.

The following figure shows the process of accessing offline replicated database by analysis users.

The following figure shows the creation of smaller data warehouses or data marts that an application analysis user uses to make a decision.

Multiple data mart architecture leads to creation of an Enterprise Data Warehouse (EDW) that accumulates data from more than one OLTP system and provides cumulated and clean data for creation of any kind of a data mart.

The following figure shows an EDW that has been created from OLTP database, which in turn further creates clean, cumulated, and specific objective data marts.

The following figure shows the schematic representation of the functional parts of an OLTP system and data warehouse.

Characteristics intrinsic to a data warehouse are:


Consolidated and consistent data Subject-oriented data Historical data Non-volatile data The following figure shows the behavior of data in RDBMS and in data warehouse.

RDBMS
UPDAT ES QUERIES

DAT A LOADS DAT A WAREHOUSE QUERIES

DATA WAREHOUSE LIFE CYCLE :


Data warehousing is a concept. It is not a product that can be purchased off the shelf. It is a set of hardware and software components integrated together which can be used to analyze the massive amount of data stored in an efficient manner. It is a process through which one can build a successful data warehouse. Following are the five steps towards building a successful data warehouse. 1) JUSTIFICATION 2) REQUIREMENT ANALYSIS 3) DESIGN 4) DEVELOPMENT & IMPLEMENTATION 5) DEPLOYMENT
9

Tools and Technologies:


The critical steps in the construction of a data warehouse: Extraction Cleansing Transformation After the critical steps, loading the results into target system can be carried out either by separate products, or by a single, categories: Code generators Database data replication tools Dynamic transformation engine

Applications:
Online Transaction Processing: OLTP systems are the major kinds of enterprise applications: Examples: Order entry systems, Inventory control systems, Reservation systems, Point-ofsale systems, Tracking systems, etc. Executive information system (EIS) : Present information at the highest level of summarization using corporate business measures. They are designed for extreme ease-of-use and, in many cases, only a mouse is required. Graphics are usually generously incorporated to provide at-a-glance indications of performance Decision Support Systems (DSS) : They ideally present information in graphical and tabular form, providing the user with the ability to drill down on selected information. Note the increased detail and data manipulation options presented.

10

Data analysis and arrangement in a data warehouse is done with the help of: Metadata Metadata is the information about data in the data warehouse which is maintained by the OLAP server. OLAP systems: Are used to arrange and analyze data in a data warehouse using the OLAP systems. Are used to extract, clean or scrub, and store data in the data warehouse in a homogeneous form after being collected from various heterogeneous sources.

Components of a data warehouse are:


Data sources: These are various source systems, such as OLTP systems and legacy systems that manage the daily transactional data of a business organization and store this data in a data warehouse. Data staging area: The data staging area, also known as data preparation area, is a collection of processes that extracts data from various sources, and then cleans, transforms, and loads the data in a data warehouse. Presentation services: Various presentation services, such as summary reports, are provided by a data warehouse to enable decision-makers in exploring the information. Data marts: These are subsets of a data warehouse that store the data specific to a particular business activity.

11

The following figure shows the various components of a data warehouse.

Roles and Responsibilities in a Data Warehouse:


A data warehouse primarily performs the following five tasks: Data extraction Data cleaning Data loading Querying Backup and recovery The following figure shows the data warehousing process, detailing the preceding tasks.

12

The preceding tasks are the responsibilities of the following three roles in a data warehouse: Load Manager Warehouse Manager Query Manager

The detailed tasks of a load manager are: Extracting data from disparate sources Fast-loading extracted data into a temporary database Performing simple data transformations

The following figure shows the role structure of a load manager.

The various tools used by a load manager for extracting and loading the data are: Fast loader: Used for fast loading of data from operational to temporary database. Copy management tool: Used for simple transformation. Stored procedures: Used for checking and cleaning of data. Shell scripts: Used for automating the processes and scheduling job control for an unattended execution.

13

The detailed tasks of a warehouse manager are: Analyzing data for consistency and referential integrity check. Creating indexes, views, and partitions of the base data. Generating new aggregations. Updating existing aggregations. Creating back-up data. Archiving obsolete data.

The following figure shows the role structure of a warehouse manager.

The tools used by a warehouse manager are: Stored procedures that create indexes, generate, and upgrade aggregations, as well as, multidimensional schemas. System management tools for backup and archiving data. Data warehouse-specific tools for query-specific analysis. The detailed tasks performed by a query manager are: Directing query to appropriate tables. Scheduling execution of user queries.

14

The following figure shows the role structure of a query manager.

Query

Query Manager Query Redirection Stored Procedures Query Managment Tool Query Scheduling

Meta Data

Detailed Inf ormation

Summary Inf ormation

The tools used by a query manager are: User access tools or stored procedures for directing queries to appropriate tables. Stored procedures, user access tools, third party software, or database facilities to schedule execution of queries.

15

Understanding Data Marts

In an enterprise data warehouse, there can be a collection of smaller data warehouses known as data marts.

Data Mart: Is a specific subset of a data warehouse, stored within its own database. Contains the data required at a department level or for a specific business area of the organization. Makes query processing faster by having less volume of data from a typical data warehouse. Also enables mobility of data due to reduced size of the data.

The following figure shows the creation of several data marts from an enterprise data warehouse.

16

Implementation of a Data Mart:


In an organization, the implementation of a data mart is generally done by enterprise Information Technology department or a vendor or may be by both of them working together. The integration of internal expertise and vendor helpdesk can be the best and cost effective solution as well as technological interpretation of the organization vision can be implemented easily through own employees.

Maintenance of a Data Mart:


Needs periodical effort of loading, refreshing/ updating, and deleting the data from the data mart. Has to be done on a regular cycle based on predefined frequency requirement of a specific data mart of the department.

Development approaches in a Data Mart are:

Top down approach: In the top down approach, the data warehouse is created first and the dependent data marts are created after that, as shown in the following figure.
ETS DATAMARTS

Data Warehouse

TOP DOWN APPROACH

17

Bottom up approach: In the bottom up approach, the data marts are created first, and these data marts together contribute to the development of the data warehouse, as shown in the following figure.
ETS DATAMARTS

2 Data Warehouse 3

BOTTOM UP APPROACH

Hybrid approach: This approach is a fast and high user-orientation approach, like the
bottom up approach, and maintains data integrity of a data warehouse, like the top-down approach. The following figure shows the hybrid approach of creating a data mart.
ETS DATAMARTS

2 Data Warehouse 3

HYBRID APPROACH

18

Federated approach: This approach recommends ways to collect large amount of


heterogeneous data from other data warehouses, data marts, and packaged applications that earlier exist inside companies. The goal of a federated approach is to integrate existing analytic structures wherever and however possible.

19

Describing OLAP

OLAP is a crucial element of an enterprise data warehouse or data mart solution. It fits into data warehousing and data mart strategies to deliver an exceptional and convincing way for data reporting, scrutiny, analysis, modeling, planning, and in an enterprise. OLAP is a process of analyzing and processing data from variant data sources, such as a data warehouse. OLAP is a process of analyzing and processing data from variant data sources, such as a data warehouse.

The benefits of OLAP are:


OLAP enables enterprises to respond to market demands more efficiently. Developers using the software specially designed for OLAP solutions are able to deliver applications to end-users faster and provide better service to them. OLAP systems improve the performance of OLTP systems by reducing network traffic and eliminating complex queries from the OLTP database.

The features of OLAP are:


Multidimensional views: OLAP enables business analysts to analyze and store the data in multidimensional structures. The multidimensional data views are referred as cubes. Calculation-intensive capabilities: OLAP applications have the capability to perform complex calculations and aggregations on the stored data, such as percentage of totals, calculation of profits, and so on. These complex calculations and aggregations are beneficial in reaching the ultimate business solutions. Time intelligence: All OLAP applications use the time dimension. This is the most important and widely used parameter for performing business analysis. The time parameter is used to compare and judge the performance of a business process.

20

The following table lists down some basic differences between OLTP and OLAP systems.

21

DATA MINING
What is data mining? Data Mining refers to the process of analyzing the data from different perspectives and summarizing it into useful information. Data mining software is one of the numbers of tools used for analyzing data from many different dimensions or angles, categorize it, and summarize the relationship identified. Definition: Data mining is the process of finding correlation or patterns among fields in large relational databases. The process of extracting valid, previously unknown, comprehensible, and actionable information from large databases and using it to make crucial business decision

Different Types of Data Mining: Business, Scientific and Internet Data Mining Five major elements of Data Mining: 1. Extract, transform, & load transaction data on to the data warehouse system. 2. Store and manage data in multidimensional database system. 3. Provide access to business analysts and IT Professionals. 4. Analyze the data by application software. 5. Present the data in useful format such as graph or table.

22

DATA MINING: A KDD Process

Steps of KDD Process 1. Learning the application domain 2. Relevant prior knowledge and goals of application 3. Creating a target data set: data selection 4. Data cleaning and preprocessing 5. Data reduction and transformation 6. Find useful features, dimensionality or variable reduction, and invariant representation. 7. Choosing functions of data mining 8. Summarization, classification, regression, association, clustering. 9. Choosing the mining algorithm(s) 10. Data mining: search for patterns of interest 11. Pattern evaluation and knowledge presentation 12. Visualization, transformation, removing redundant patterns, etc. 13. Use of discovered knowledge. Methods of Data Mining: 1. Classification 2.Regression 3.Clustering 4.Associative rules 5.Visualization

23

Summary

A data warehouse is a large repository of data, which helps in decision-making process of an enterprise. The four characteristics intrinsic to a data warehouse are: Consolidated and consistent data Subject-oriented data Historical data Non-volatile data In a data warehouse, a specific type of data is used that contains information about types of data known as metadata. The various components of a data warehouse are: Data sources Data preparation area Presentation services Data marts Data warehouse primarily performs the following five tasks: Data extraction Data cleaning Data loading Querying Backup and recovery The following roles perform the above-defined tasks in a data warehouse: Load Manager Warehouse Manager Query Manager

24

The four development approaches in creating a data mart are: Top down approach Bottom up approach Hybrid approach Federated approach The features of OLAP are: Multidimensional views Calculation intensive capabilities Time intelligence

25

CONCLUSION
Data Warehousing provides the means to change the raw data into information for making effective business decisions-the emphasis on information, not data. The Data warehouse is the hub for decision support data. Data mining is a useful tool with multiple algorithms that can be tuned for specific tasks. It can benefit business, medicine, and science. It needs more efficient algorithms to speed up data mining process.

26

Вам также может понравиться