Академический Документы
Профессиональный Документы
Культура Документы
Introduction - Data
Data are any facts, numbers, or text that can be processed by a computer. Today, organizations are accumulating vast and growing amounts of data in different formats and different databases. This includes:
operational or transactional data such as, sales, cost, inventory, payroll, and accounting nonoperational data, such as industry sales, forecast data, and macro economic data meta data - data about the data itself, such as logical database design or data dictionary definitions
Manoj Pandia
Introduction - Information
The patterns, associations, or relationships among all this data can provide information. For example, analysis of retail point of sale transaction data can yield information on which products are selling and when.
Manoj Pandia
Introduction - Knowledge
Information can be converted into knowledge about historical patterns and future trends. For example, summary information on retail supermarket sales can be analyzed in light of promotional efforts to provide knowledge of consumer buying behavior. Thus, a manufacturer or retailer could determine which items are most susceptible to promotional efforts.
Manoj Pandia
Manoj Pandia
Presence of Data
Education Hospital Manufacturing Industry Finance Banking Marketing Retailing Insurance Transport And so on.
Manoj Pandia
A database is a collection of related data and a database system is a database and database software together. Databases are transactional such as relational, object-oriented, network or hierarchical. Traditional databases support on-line transaction processing (OLTP), which includes insertions, updates, and deletions, while also supporting information query requirements. Traditional databases are optimized to process queries that may touch a small part of the database and transactions that deal with insertions or updates of a few tuples per relation to process. Thus databases must strike a balance between efficiency in transaction processing and supporting query requirements (ad hoc user requests),That is, they can't further optimized for the applications such as OLAP, DSS and data mining.
A data warehouse is also a collection of in formation as well as a supporting system. But a data warehouse is typically optimized for access from a decision maker's needs. Data warehouses are designed specifically to support efficient extraction, processing and presentation for analytic and decisionmaking purposes. In contrast to databases, data warehouses generally contain very large amounts of data from multiple sources that may include databases from different data models and sometimes files acquired from independent systems and platforms.
Manoj Pandia
Data Mart
A data mart is an easy-to-access repository of a subset of highly focused data for a single function or department (i.e., finance, sales, marketing) and is considerably smaller than a data warehouse. The data comes form operational information that is needed by a particular group of employees for analysis, content, presentations all in terms that are familiar to them. Data for a data mart is derived from a data warehouse or from more specialized access.
Manoj Pandia
The Evolution
Evolutionary Step Data Collection (1960s) Data Access (1980s) Business Question
"What was my total revenue in the last five years?" "What were unit sales in Relational databases Oracle, Sybase, New England last March?" (RDBMS), Structured Informix, IBM, Query Language (SQL), Microsoft ODBC Data "What were unit sales in On-line analytic Pilot, Comshare, Warehousing New England last March? processing (OLAP), Arbor, Cognos, & Decision Drill down to Boston." multidimensional Microstrategy Support databases, data (1990s) warehouses Data Mining "Whats likely to happen Advanced algorithms, Pilot, Lockheed, (Emerging to Boston unit sales next multiprocessor IBM, SGI, Today) month? Why?" computers, massive numerous databases startups (nascent industry)
Characteristics
Retrospective, static data delivery Retrospective, dynamic data delivery at record level Retrospective, dynamic data delivery at multiple levels
Manoj Pandia
Data mining refers to extracting or mining knowledge from large amounts of data. Mining of gold from rocks or sand is referred to as gold mining rather than rock or sand mining. Thus, data mining should have been more appropriately named knowledge mining from data, which is unfortunately somewhat long. Knowledge mining, a shorter term, may not reflect the emphasis on mining from large amounts of data.
Manoj Pandia
lead to tremendous amounts of data stored in databases, data warehouses and other information repositories
We are drowning in data, but starving for knowledge! Solution: Data warehousing and data mining
Manoj Pandia
Many people treat data mining as a synonym for Knowledge Discovery from Data, or KDD Others view data mining as simply an essential step in the process of knowledge discovery
Manoj Pandia
Data Mining
Selection & Transformation Patterns
Manoj Pandia
2. 3.
4.
5.
6. 7.
Data cleaning (to remove noise and inconsistent data) Data integration (where multiple data sources may be combined) Data selection (where data relevant to the analysis task are retrieved from the database) Data transformation (where data are transformed or consolidated into forms appropriate for mining by performing summary or aggregation operations, for instance) Data mining (an essential process where intelligent methods are applied in order to extract data patterns) Pattern evaluation (to identify the truly interesting patterns representing knowledge based on some interestingness measures) Knowledge presentation (where visualization and knowledge representation techniques are used to present the mined knowledge to the user)
Architecture
In principle, data mining should be applicable to any kind of data repository Thus the scope of our examination of data repositories will include
data mining tasks can be classified into two categories: Descriptive characterize the general properties of the data in the
database
Concept/Class Description: Characterization and Discrimination Mining Frequent Patterns, Associations, and Correlations Classification and Prediction Cluster Analysis Outlier Analysis Evolution Analysis
Major Issues
Mining different kinds of knowledge in databases Interactive mining of knowledge at multiple levels of abstraction Incorporation of background knowledge Data mining query languages and ad hoc data mining Presentation and visualization of data mining results Handling noisy or incomplete data Pattern evaluationthe interestingness problem Efficiency and scalability of data mining algorithms Parallel, distributed, and incremental mining algorithms
Handling of relational and complex types of data Mining information from heterogeneous databases and global information systems
Performance issues