Вы находитесь на странице: 1из 5

Data Mining

Mr. Umesh .B. Pawar1, Mr. Purushottam .R. Patil 2


ME CSE Part - IV Team Member, Jala-Sri A Water Shade Surveillance Research Institute MJ College Jalgaon
1

ME CSE Part IV GECA Aurangabad

Preface:Our capabilities of both generating and collecting data have been increasing rapidly. Contributing factors include the computerization of business, scientific, and government transaction; the widespread use of digital cameras, publication tools, and bar codes for most commercial products; and advances in data collection tools ranging from scanned text and image platforms to satellite remote sensing system. In addition, popular use of the World Wide Web as a global information system has flooded us with a tremendous amount of data and information. This explosive growth in stored or transient data has generated an urgent need for new techniques and automated tools that can intelligently assist us in transforming the vast amounts of data into useful information and knowledge. Introduction:Data mining is a multidisciplinary field, drawing work from areas including database technology, machine learning, statistics, patterns recognition, information retrieval, neural networks, knowledge-based systems, artificial intelligence, high performance computing, and data visualization. We present techniques for the discovery of patterns hidden in large data sets, focusing on issues relating to their feasibility, usefulness, effectiveness, and scalability. Data mining emerged during the late 1980s, made great strides during the 1990s, and continues to flourish into the new millennium. What Motivated Data Mining? Necessity is the mother of invention. Data mining has attracted a great deal of attention in the information industry and in society as a whole in recent years, due to the wide availability of huge amounts of data and the imminent need for turning such data into useful information and knowledge. The information and knowledge gained can be used for applications ranging from market analysis, fraud detection, and customer retention, to production control and science exploration. Data mining can be viewed as a result of the natural evolution of information technology. The

database system industry has witnessed an evolutionary path in the development of the following functionalities: data collection and database creation, data management and advanced data analysis. For instance, the early development of data collection and database creation mechanism served as a prerequisite for later development of effective mechanism for data storage and retrieval, and query and transaction processing. With numerous database system offering query and transaction processing as common practice, advanced data analysis has naturally become the next target. Data can now be stored in many different kinds of databases and information repositories. One data repository architecture that has emerged is the data warehouse, a repository architecture of multiple heterogeneous data sources organized under a unified schema at a single site in order to facilitate management decision making. Data warehouse technology include data clearing, data integration, and online analytical processing that is, analysis technique with functionality such as summarization, consolidation and aggregation as well as the ability to view information from different angle, Although OLAP tools support multidimensional analysis and decision making, additional data analysis tools are required for indepth analysis, such as data classification, clustering, and the characterization of data changes over time. In addition, huge volumes of data can be accumulated beyond databases and data warehouses. Typical example include the world side web and data stream, where data flow in and out like stream, as in applications like video surveillance, telecommunication, and sensor networks. The effective and efficient analysis of data in such different forms becomes a challenging task. Consequently, important decisions are often made based not on the information-rich data stored in data repository, but rather on a decision makers intuition, simply because the decision maker does not have the tools to extrat the valuable knowledge embedded in the vast amount of data.

Data collection and Database Creation - Primitive file processing

Database Management Systems


Hierarchical and network database system Relational database system Data modeling tools Indexing and accessing methods Query language: SQL, etc. User interfaces, forms and reports Query processing and query optimization Transactions, concurrency control and recovery On-line transaction processing (OLTP).

Advanced Database System


-Advanced data models: objectrelational, etc. -Advanced applicaton: spatical, temporal, multimedia, active.

Advanced data Analysis: Data Warehousing and mining


- Data warehousing and OLAP - Data mining and knowledge discovery: generalization, classification, association and structured pattern analysis, etc. - Advanced data mining application: Stream data mining, bio-data mining, text mining, web mining, etc. - Data mining and society: Privacy-preserving data mining

Web-based database
- XML-based database systems - Integration with informational retrieval - data and information integration.

New generation of Integrated Data Information Systems

Figure The evolution of database system technology. What is data mining? Data mining is the process of discovering Interesting knowledge from large amount of data stored in databases, data warehouses, or other information repositories.

The architecture of a typical data mining system may have the following major component: - Database, data warehouses, World Wide Web, or other information repository: This is one or a set of databases, data warehouses, spreadsheets, or other kinds of information repositories. Data clearing and data integration techniques may be performed on the data. 1. Data clearing( to remove noise and inconsistent data) 2. Data integration( where multiple data sources may be combined) - Database or data warehouse server: The database or data warehouse server is responsible for fetching the relevant data, based on the users data mining request.

User Interface

-Pattern evolution modules: This component typically employs interestingness measures and interacts with the data mining modules so as to focus the search toward interesting patterns. It may use interestingness threshold to filter out discovered patterns. Alternatively, the pattern evalution module may be integrated with the mining module, depending on the implementation of the data mining method used. -User interface: this module communicates between users and the data mining system, allowing the user to interact with the system by specifying a data mining query or task providing the information to help focus the search, and performing exploratory data mining based on the intermediate data mining result. In addition, this component allows the user to browse database and data warehouse schemas or data structures, evaluate mined patterns, and visualize the patterns in different forms. Data mining functionalities We have observed various types of database and information repositories on which data mining can be performed. Let us now examine the kinds of data patterns that can be mined. Data mining functionalities are used to specify the kind of patterns to be found in data mining task can be classified into two categories: descriptive and predictive. Descriptive mining task characterize the general properties of the data in the database. Predictive mining task performs inference on the current data in order to make predictions.

Pattern Evolution
Knowledge base

Data mining Engine

Database or Data warehouse server

Data clearing, integration and selection

Database

Data warehouse

World wide web

Other information

Figure Architecture of a data mining system. -Knowledge base: This is the domain knowledge that is used to guide the search or evaluate the interesting of resulting patterns. -Data mining engine: This is essential to the data mining system and ideally consists of a set of functional modules for task such as characterization, association and correction analysis, and evolution analysis.

Concept/class Description: Data can be associated with classes or concepts. For example, in the All Electronics store, classes of items for sale include computers and printers, and concepts of customers include big spenders and budgetspenders. It can be useful to describe individual classes and concepts in summarized, concise, and yet precise terms. Such descriptions of a class or a concept are called class/concepts description. These description can be derived via 1) data characterization by summarizing the data of the of the class under study in general terms, or 2) data discrimination by comparison of the target class with one or a set of comparative classes. Or 3) both data characterization and discrimination. Data characterization is a summarization of the general characteristics or features of a target class of

data. The corresponding to the user-specified class or typically collected by a database query. Example Data description is a comparison of the general features of target class data objects with the general features of objects for one or a set of contrasting classes. The target and contrasting classes can be specified by the user, and the corresponding data objects retrieved through database queries. For example, the user may be like to compare the general features of software products whose sales increased by 10% in the last year with those whose sales decreased by at least 30% during the same period. The method used for data discrimination is similar to those used for data characterization. Mining Frequent patterns, Association, and Correlations Frequent patterns, as the name suggest, are patterns, that occur frequently in data. These are many kinds of frequent patterns, including item sets, subsequences, and substructures. A frequent item set typically refers ton a set of items that frequently appear together in a transaction data set, such as milk, and bread. A frequently occurring subsequence, such as the pattern that customers tend to purchase first a PC, followed by a digital camera, and then a memory card, is a sequential pattern. Substructures can refer to different structural forms, such as graph, trees, or lattices, which may be combined with itemsets or subsequences. If a substructure occurs frequently, it is called a structured pattern. Mining frequent patterns lead to the discovery of interesting associations and correlations within data. Classification and prediction Classification is the process of finding the model that describes and distinguishes data classes or concepts, for the purpose of being able to use the model to predict the class of objects whose class label is unknown. How is the derived model presented? The derived model may be represented in various forms such as classification rules, decision trees, mathematical formulae, or neural networks. A decision trees is a flow chart like tree structures, where each node denotes atest on attributes value, each branch represent an outcome of the test, and trees leaves represens classes or class distribution. Decision tree easily be converted to classification rules. A neural network, when used for classification is typically collection of neural-like processing units with weighted connection between the units. There are many other methods for constructing

classification models, such as nave Bayesian classification, support vector machines, and Knearest neighbor classification. Cluster analysis What is cluster analysis? unlike classification and prediction, which analyze class labeled data objects, clustering analysis data objects without consulting a known class label. In general, the class labels are not present in the training data simply because they are not known to begin with. Clustering can used to generate such labels. The objects are clustered or grouped based on the principle of maximizing the intraclass similarity and minimizing the interclass similarity. That is, clustered of objects are formed so that objects within a cluster have high similarity in comparison to one another, but are very dissimilar to objects in other clusters. Outlier analysis A database may contain data objects that do not comply with the general behavior or model of the data. These data objects are outlier, most data mining methods discard outliers as noise or exception. However, in some applications such as fraud detection, the rare events can be more interesting than the more regularly occurring ones. The analysis of outlier data is referred to as outlier mining. Evolution analysis Data evolution analysis describes and models regularities or trends for objects whose behavior changes over time. Classification of data mining systems Data mining is an interdisciplinary field, the confluence of a set of disciplines, including database systems, statics, machine learning, visualization and information science. Moreover, depending on the data mining approach used techniques from other disciplines may be applied, such as neural networks, fuzzy and rough set theory, knowledge representation, inductive logic programming, or highperformance computing. Depending on the kinds of data to be mined or on the given data mining application, the data mining system may also integrate techniques from spatial data analysis, information retrieval, pattern recognition, image analysis, computer graphics, web technology, or psychology.

Database technology

Statistics

finance, telecommunication, DNA, stock market, email, and so on.

Informatio n science

Data mining

Machine learning

Outer disciplines Visualization

Figure data mining as a confluence of multiple disciplines Data mining systems can be categorized according to various criteria, as follows: Classification according to the kinds of database mined: a data mining system can be classified according to kinds of database mined. Database system can be classified according to different criteria, each of which may require its own data mining technique. Data mining systems can therefore be classified accordingly. For instance, if classifying according to data model, we may have a relational, transactional, objectrelational, or data warehouse mining system. Classification according to the kinds of knowledge mined: data mining system can be cauterized according kinds of knowledge they may mine, that is, based on data mining functionality, such as characterization, discrimination, association, evolution analysis. A comprehensive data mining system usually provides multiple and integrated data mining functionalities. Classification according to the kinds of techniques utilized: data mining system can be cauterized according to the underlying data mining techniques employed. These techniques can be described according to the degree of user interaction involved or he method of data analysis employed (e.g. database oriented or data warehouse-oriented techniques, machine learning, statics, visualization, patterns recognition and so on ). Classification according to the application adapted: data mining system can also be categorized according to application they adapt, for example, data mining system may be tailored specifically for

Вам также может понравиться