Вы находитесь на странице: 1из 21

Data & Web Mining

Manoj Pandia, Silicon Institute of Technology

Introduction - Data

Data are any facts, numbers, or text that can be processed by a computer. Today, organizations are accumulating vast and growing amounts of data in different formats and different databases. This includes:

operational or transactional data such as, sales, cost, inventory, payroll, and accounting nonoperational data, such as industry sales, forecast data, and macro economic data meta data - data about the data itself, such as logical database design or data dictionary definitions

Manoj Pandia

Introduction - Information

The patterns, associations, or relationships among all this data can provide information. For example, analysis of retail point of sale transaction data can yield information on which products are selling and when.

Manoj Pandia

Introduction - Knowledge

Information can be converted into knowledge about historical patterns and future trends. For example, summary information on retail supermarket sales can be analyzed in light of promotional efforts to provide knowledge of consumer buying behavior. Thus, a manufacturer or retailer could determine which items are most susceptible to promotional efforts.

Manoj Pandia

Data, Information & Knowledge

Manoj Pandia

Presence of Data

Data is found everywhere


Education Hospital Manufacturing Industry Finance Banking Marketing Retailing Insurance Transport And so on.

Manoj Pandia

Database v/s Data Warehouse


A database is a collection of related data and a database system is a database and database software together. Databases are transactional such as relational, object-oriented, network or hierarchical. Traditional databases support on-line transaction processing (OLTP), which includes insertions, updates, and deletions, while also supporting information query requirements. Traditional databases are optimized to process queries that may touch a small part of the database and transactions that deal with insertions or updates of a few tuples per relation to process. Thus databases must strike a balance between efficiency in transaction processing and supporting query requirements (ad hoc user requests),That is, they can't further optimized for the applications such as OLAP, DSS and data mining.

A data warehouse is also a collection of in formation as well as a supporting system. But a data warehouse is typically optimized for access from a decision maker's needs. Data warehouses are designed specifically to support efficient extraction, processing and presentation for analytic and decisionmaking purposes. In contrast to databases, data warehouses generally contain very large amounts of data from multiple sources that may include databases from different data models and sometimes files acquired from independent systems and platforms.

Manoj Pandia

Data Mart

A data mart is an easy-to-access repository of a subset of highly focused data for a single function or department (i.e., finance, sales, marketing) and is considerably smaller than a data warehouse. The data comes form operational information that is needed by a particular group of employees for analysis, content, presentations all in terms that are familiar to them. Data for a data mart is derived from a data warehouse or from more specialized access.

Manoj Pandia

The Evolution
Evolutionary Step Data Collection (1960s) Data Access (1980s) Business Question
"What was my total revenue in the last five years?" "What were unit sales in Relational databases Oracle, Sybase, New England last March?" (RDBMS), Structured Informix, IBM, Query Language (SQL), Microsoft ODBC Data "What were unit sales in On-line analytic Pilot, Comshare, Warehousing New England last March? processing (OLAP), Arbor, Cognos, & Decision Drill down to Boston." multidimensional Microstrategy Support databases, data (1990s) warehouses Data Mining "Whats likely to happen Advanced algorithms, Pilot, Lockheed, (Emerging to Boston unit sales next multiprocessor IBM, SGI, Today) month? Why?" computers, massive numerous databases startups (nascent industry)

Enabling Technologies Computers, tapes, disks

Product Providers IBM, CDC

Characteristics
Retrospective, static data delivery Retrospective, dynamic data delivery at record level Retrospective, dynamic data delivery at multiple levels

Prospective, proactive information delivery

Manoj Pandia

What is Data Mining

Data mining refers to extracting or mining knowledge from large amounts of data. Mining of gold from rocks or sand is referred to as gold mining rather than rock or sand mining. Thus, data mining should have been more appropriately named knowledge mining from data, which is unfortunately somewhat long. Knowledge mining, a shorter term, may not reflect the emphasis on mining from large amounts of data.

Manoj Pandia

Why Data Mining?

Data explosion problem

Automated data collection tools and mature database technology

lead to tremendous amounts of data stored in databases, data warehouses and other information repositories

We are drowning in data, but starving for knowledge! Solution: Data warehousing and data mining

Data warehousing and on-line analytical processing

Extraction of interesting knowledge (rules, regularities, patterns,


constraints) from data in large databases

Manoj Pandia

Why Data Mining?


How can I analyze my data?

We are data rich, but information poor


Manoj Pandia

Data Mining & KDD

Many people treat data mining as a synonym for Knowledge Discovery from Data, or KDD Others view data mining as simply an essential step in the process of knowledge discovery

Manoj Pandia

Data Mining & KDD


Evaluation & Presentation Knowledge

Data Mining
Selection & Transformation Patterns

Cleaning & Integration

Task Relevant Data Data Warehouse Databases

Manoj Pandia

Data Mining & KDD


1.

2. 3.
4.

5.

6. 7.

Data cleaning (to remove noise and inconsistent data) Data integration (where multiple data sources may be combined) Data selection (where data relevant to the analysis task are retrieved from the database) Data transformation (where data are transformed or consolidated into forms appropriate for mining by performing summary or aggregation operations, for instance) Data mining (an essential process where intelligent methods are applied in order to extract data patterns) Pattern evaluation (to identify the truly interesting patterns representing knowledge based on some interestingness measures) Knowledge presentation (where visualization and knowledge representation techniques are used to present the mined knowledge to the user)

Architecture

Data Mining: on What Kind of Data


In principle, data mining should be applicable to any kind of data repository Thus the scope of our examination of data repositories will include

relational databases data warehouses transactional databases advanced database systems


object-relational databases specific application-oriented databases


spatial databases time-series databases text databases multimedia databases

flat files data streams World Wide Web

Data Mining: What Kinds of Patterns Can Be Mined?


data mining tasks can be classified into two categories: Descriptive characterize the general properties of the data in the

database

Predictive Perform inference on the current data in order to make predictions

Data Mining: What Kinds of Patterns Can Be Mined?

Concept/Class Description: Characterization and Discrimination Mining Frequent Patterns, Associations, and Correlations Classification and Prediction Cluster Analysis Outlier Analysis Evolution Analysis

Major Issues

Mining methodology and user interaction issues

Mining different kinds of knowledge in databases Interactive mining of knowledge at multiple levels of abstraction Incorporation of background knowledge Data mining query languages and ad hoc data mining Presentation and visualization of data mining results Handling noisy or incomplete data Pattern evaluationthe interestingness problem Efficiency and scalability of data mining algorithms Parallel, distributed, and incremental mining algorithms
Handling of relational and complex types of data Mining information from heterogeneous databases and global information systems

Performance issues

Issues relating to the diversity of database types

Classification of Data Mining Systems

Вам также может понравиться