Вы находитесь на странице: 1из 10

K. E.

Society's
Rajarambapu Institute of Technology, Rajaramnagar

QUANTUM 06

A paper on

DATA MINING AND WAREHOUSING

Miss.Kulkarni Aditi Miss.Mulla vahida

(T. E. I.T) (T. E. I.T.)

mailto:aditi_k10@rediff.com mailto:vahida.mulla@rediffmail.com

WALCHAND COLLEGE OF ENGINEERING,

SANGLI.
Abstract

We live in the Age of Information. The importance of collecting data that reflect business or
scientific activities to achieve competitive advantage is widely recognized now days. Powerful systems for
collecting data and managing it in large databases are in place in all large and mid-range companies.
However, the bottleneck of turning this data into our success is the difficulty of extracting knowledge about
the system we study. The problems can probably be solved if information hidden among megabytes of data
in your database can be found explicitly and utilized. Modeling the investigated system, discovering
relations that connect variables in a database are the subject of data mining.

Modern computer data mining systems self learn from the previous history of the
investigated system, formulating and testing hypotheses about the rules, which this system obeys. The
primary concept behind data warehousing is that the nonvolatile data stored for business analysis can be
most effectively managed by separating it from the active data in the operational systems.

The following presentation covers data mining and data warehousing concepts,
implementations and applications. Starting with introductory concepts we proceed with exhaustive
discussion of hard facts of data warehousing and data mining. Then we switch over to pros and cons of the
subject along with some concluding remarks.
INDEX

1. Introduction

What is data warehousing and data mining?

Reasons for the growing popularity of Data Mining

2. Data Warehousing

Data warehousing concepts

Data warehousing implementations

Characteristics of data warehousing

3. Data mining

How does data mining work?

Different Data Mining Technologies and Systems

4. Pros and cons of data warehousing and data mining

5. Applications

6. Conclusion and references


Introduction

In today’s competitive global business environment, understanding and managing


enterprise wide information is crucial for making timely decisions and responding to
changing business conditions. Many companies are realizing a business advantage by
leveraging one of their key assets - business data. There is a tremendous amount of data
generated by day-to-day business operational applications. In addition there is valuable
data available from external sources such as market research organizations, independent
surveys and quality testing labs. Studies indicate that the amount of data in a given
organization doubles every five years. Data Warehousing has emerged as an increasingly
popular and powerful concept of applying information technology to turn these huge
islands of data into meaningful information for better business decisions.

What is data mining?


Generally, data mining (sometimes called data or knowledge discovery) is the process
of analyzing data from different perspectives and summarizing it into useful information -
information that can be used to increase revenue, cuts costs, or both. Data mining
software is one of a number of analytical tools for analyzing data. It allows users to
analyze data from many different dimensions or angles, categorize it, and summarize the
relationships identified. Technically, data mining is the process of finding correlations or
patterns among dozens of fields in large relational databases

What is Data Warehousing?

According to Bill Inmon, known as the father of Data Warehousing, a data warehouse
is a subject oriented, integrated, time-variant, nonvolatile collection of data in support of
management decisions.

Data warehousing is a concept. It is a set of hardware and software components that can
be used to better analyze the massive amounts of data that companies are accumulating to
make better business decisions. Data Warehousing is not just data in the data warehouse,
but also the architecture and tools to collect, query, analyze and present information.

The primary concept behind data warehousing is that the nonvolatile data stored for
business analysis can be most effectively managed by separating it from the active data in
the operational systems. Nonvolatile data is data that is not modified or rarely modified
after being moved from operational systems to a data warehouse.

Reasons for the growing popularity of Data Mining

1.Growing Data Volume

2.Limitations of Human Analysis

3. Low Cost of Machine Learning


Data mining

How does data mining work?

While large-scale information technology has been evolving separate transaction and
analytical systems, data mining provides the link between the two. Data mining software
analyzes relationships and patterns in stored transaction data based on open-ended user
queries. Several types of analytical software are available: statistical, machine learning,
and neural networks. Generally sought four types of relationships are:

• Classes: Stored data is used to locate data in predetermined groups.


• Clusters: Data items are grouped according to logical relationships or consumer
preferences.
• Associations: Data can be mined to identify associations.
• Sequential patterns: Data is mined to anticipate behavior patterns and trends.

Data mining consists of five major elements:

• Extract, transform, and load transaction data onto the data warehouse system.

• Store and manage the data in a multidimensional database system.

• Provide data access to business analysts and information technology


professionals.

• Analyze the data by application software.

• Present the data in a useful format, such as a graph or table.

Data Mining Systems

Subject-oriented analytical systems


We consider systems for analysis of financial markets based on the method of
technical analysis. Technical analysis represents a bundle of a few dozen different
techniques for forecasting of prices dynamics and selecting the optimal structure of
investment portfolio, based on various empirical models of the market behavior.
Such systems usually provide another advantage to the user. They operate in
specific terms of the application field.
Statistical packages
In statistical packages traditional statistical methods are supplemented by some
elements of data mining. Their main data analysis methods remain to be of the classical
nature: correlation, regression, and factor analyses and other techniques of that kind.
One of the major drawbacks of such systems is that they do not allow the data
analysis to be performed by the user who does not have a thorough training in statistics.
Another disadvantage of statistical packages is that during the data exploration the user
has to perform again and again a set of some elementary operations.
Neural Networks
In order to make meaningful predictions a neural network first has to be trained on
data describing previous situations for which both, input parameters and correct reactions
to them are known. Training consists of selecting weights ascribed to intraneural
connections that provide the maximal closeness of reactions produced by the network to
the known correct reactions.
The drawback is a non-transparency of forecasting models represented by a trained neural
network.
Evolutionary Programming
At present this is the youngest and evidently the most promising branch of data
mining. The underlying idea of the method is that the system automatically formulates
hypotheses about the dependence of the target variable on other variables in the form of
programs expressed in an internal programming language. By utilizing a universal
programming language the approach ensures that any dependence or algorithm can be
expressed in this language in principle. The process of production of internal programs
(hypotheses) is organized as evolution in the world of all possible programs.
Decision Trees
This method can be applied for solution of classification tasks only. This limits
applicability of the decision trees method in many fields. As a result of applying this
method to a training set, a hierarchical structure of classifying rules of the type
"IF...THEN..." is created. This structure has a form of a tree.
An advantage of this method is that this form of representation of rules is intuitive
and easily understood by the human.
Genetic Algorithms
Genetic Algorithms can be viewed as a powerful technique for solution of various
combinatorial or optimization problems. Genetic algorithms are among standard modern
instruments for data mining.

Data warehousing

Data warehousing concepts

Operational / informational data:

Operational data is the data we use to run our business. This data is what is typically
stored, retrieved, and updated by your Online Transactional Processing (OLTP) system.
An OLTP system may be, for example, a reservations system, an accounting application,
or an order entry application. Informational data is created from the wealth of operational
data that exists in your business and some external data useful to analyze your business.

OLAP / Multi-dimensional analysis:

Relational databases store data in a two dimensional format: tables of data represented by
rows and columns. Multi-dimensional analysis solutions, commonly referred to as On-
Line Analytical Processing (OLAP) solutions, offer an extension to the relational model
to provide a multi-dimensional view of the data.
Data Marts:

Data marts are workgroup or departmental warehouses, which are small in size, typically
10-50GB. The data mart contains informational data that is departmentalized, tailored to
the needs of the specific departmental work group. Data marts are less expensive and take
less time for implementation with quick ROI. They are scaleable to full data warehouses
and at times are summarized subsets of more detailed, pre-existing data warehouses.

Metadata/Information Catalogue:

Metadata describes the data that is contained in the data warehouse (e.g. Data elements
and business-oriented description) as well as the source of that data and the
transformations or derivations that may have been performed to create the data element.

Data Warehouse Implementation

The following components should be considered for a successful implementation of a


Data Warehousing solution:

• Open Data Warehousing architecture with common interfaces for product


integration
• Data Modeling with ability to model star-schema and multi-dimensionality
• Extraction and Transformation/propagation tools to load the data warehouse
• Data warehouse database server
• Analysis/end-user tools: OLAP/multidimensional analysis, Report and query
• Tools to manage information about the warehouse (Metadata)
• Tools to manage the Data Warehouse environment

Data warehouse database servers--the heart of the warehouse:

Once ready, data is loaded into a relational database management system (RDBMS)
which acts as the data warehouse. Some of the requirements of database servers for data
warehousing include: Performance, Capacity, Scalability, Open interfaces, Multiple-data
structures, optimizer to support for star-schema, and Bitmapped indexing . Some of the
popular data stores for data warehousing are relational databases like Oracle, DB2,
Informix or specialized Data Warehouse databases like RedBrick, SAS. To provide the
level of performance needed for a data warehouse, an RDBMS should provide
capabilities for parallel processing - Symmetric Multiprocessor (SMP) or Massively
Parallel Processor (MPP) machines, near-linear scalability, data partitioning, and system
administration.

Characteristics of a data warehouse:


According to Bill Inmon, characteristics that describe a data warehouse are:
Subject-oriented: data are organized according to subject instead of application

Integrated: When data are moved from the operational environment into the data
warehouse, they assume a consistent coding convention

Time-variant: The data warehouse contains a place for storing data that are five to 10
years old, or older, to be used for comparisons, trends, and forecasting. These data are not
updated.

Non-volatile: Data are not updated or changed in any way once they enter the data
warehouse, but are only loaded and accessed.

Pros and cons of data warehousing and data mining:

• Understand business trends and make better forecasting decisions


• Bring better products to market in a more timely manner
• Analyze daily sales information and make quick decisions that can significantly
affect your company's performance

Data mining problems:


i. Data mining systems rely on databases to supply the raw data for input and this raises
problems in that databases tend be dynamic, incomplete, noisy, and large.
ii. Limited Information
iii. Noise and missing values
iv. Uncertainty
Problem with data warehousing:
One of the problems with data mining software has been the rush of companies to jump
on the band wagon as these companies have slapped `data warehouse' labels on
traditional transaction-processing products, and co-opted the lexicon of the industry in
order to be considered players in this fast-growing category.
Applications

Data warehousing can be a key differentiator in many different industries. At present,


some of the most popular Data warehouse applications include:

a. sales and marketing analysis across all industries


b. inventory turn and product tracking in manufacturing
c. category management, vendor analysis, and marketing program effectiveness
analysis in retail
d. profitable lane or driver risk analysis in transportation
e. profitability analysis or risk assessment in banking
f. claims analysis or fraud detection in insurance

Data mining has many and varied fields of applications such as:

1.Retail/Marketing

• Identify buying patterns from customers


• Find associations among customer demographic characteristics
• Predict response to mailing campaigns
• Market basket analysis

2.Banking

• Detect patterns of fraudulent credit card use


• Identify `loyal' customers
• Predict customers likely to change their credit card affiliation
• Determine credit card spending by customer groups
• Find hidden correlations between different financial indicators
• Identify stock trading rules from historical market data

3.Insurance and Health Care

• Claims analysis - i.e which medical procedures are claimed


together
• Predict which customers will buy new policies
• Identify behavior patterns of risky customers
• Identify fraudulent behavior
4.Medicine

• Characterize patient behavior to predict office visits


• Identify successful medical therapies for different illnesses

5. Transportation

• Determine the distribution schedules among outlets


• Analyze loading patterns

Conclusion

Data Warehousing provides the means to change raw data into information for
making effective business decisions--the emphasis on information, not data. The data
warehouse is the hub for decision support data. A good data warehouse will... provide the
RIGHT data... to the RIGHT people... at the RIGHT time: RIGHT NOW! While data
warehouse organizes data for business analysis. So the future of data warehousing lies in
their accessibility from the Internet. Successful implementation of a data warehouse and
data mining requires a high-performance; scaleable combination of hardware and
software which can integrate easily with existing systems, so customers can use data
warehouses to improve their decision-making--and their competitive advantage.

References

Web Resources:

www.datawarehousingonline.com
www.megaputer.com
www.anderson.ucla.edu

Book:

Data mining –Adriaans.P & Zantinge.D

Вам также может понравиться