Вы находитесь на странице: 1из 5

Major issues in Data Mining

Major issues in Data Mining research are partitioning them into 5 groups: mining methodology, user interaction, performance (efficiency and scalability), diversity of data types and data mining & society. Mining Methodology and user interaction issues These reflect the kinds of knowledge mined the ability to mine knowledge at multiple granularities, the use of domain knowledge, ad hoc mining, and knowledge visualization. Mining different kinds of knowledge in databases: Interactive mining of knowledge at multiple levels of abstraction: Incorporation of background knowledge: Data mining query languages and ad hoc data mining: Presentation and visualization of data mining results: Handling noisy or incomplete data: Pattern evaluationthe interestingness problem:

Performance (efficiency and scalability) issues These include efficiency, scalability, and parallelization of data mining algorithms. Efficiency and scalability of data mining algorithms: Parallel, distributed, and incremental mining algorithms: Diversity of database types Handling of relational and complex types of data: Mining information from heterogeneous databases and global information systems: Data mining and society How does mining impact society? What steps can mining take to preserve the privacy of individuals? Do we use mining in our daily lives without knowing that we do? These questions raise the following issues Social impacts of data mining: Privacy-preserving data mining: Invisible data mining:

Data preprocessing
The data you wish to analyze by data mining techniques are incomplete (lacking attribute values or certain attributes of interest, or containing only aggregate data), noisy (containing errors), and inconsistent. How can the data be preprocessed in order to help improve the quality of the data and, consequently, of the mining results? How can the data be preprocessed so as to improve the efficiency and ease of the mining process? There are a number of data preprocessing techniques. Data cleaning can be applied to remove noise and correct inconsistencies in the data. Data integration merges data from multiple sources into a coherent data store, such as a data warehouse. Data transformations, such as normalization, may be applied. For example, normalization may improve the accuracy and efficiency of mining algorithms

involving distance measurements. Data reduction can reduce the data size by aggregating, eliminating redundant features, or clustering, for instance.

Data cleaning
Real-world data tend to be incomplete, noisy, and inconsistent. Data cleaning routines attempt to fill in missing values, smooth out noise while identifying outliers, and correct inconsistencies in the data. Missing values How can you go about filling in the missing values for this attribute? Lets look at the following methods: Ignore the tuple Fill in the missing value manually Use a global constant to fill in the missing value Use the attribute mean to fill in the missing value: Use the attribute mean for all samples belonging to the same class as the given tuple: Use the most probable value to fill in the missing value:

Noisy Data Noise is a random error or variance in a measured variable. Techniques are : Binning: Binning methods smooth a sorted data value by consulting its neighborhood, that is, the values around it. The sorted values are distributed into a number of buckets, or bins. Because binning methods consult the neighborhood of values, they perform local smoothing. smoothing by bin means smoothing by bin medians smoothing by bin boundaries Regression: Data can be smoothed by fitting the data to a function, such as with regression. Linear regression involves finding the best line to fit two attributes (or variables), so that one attribute can be used to predict the other. Multiple linear regressions is an extension of linear regression, where more than two attributes are involved and the data are fit to a multidimensional surface.

Clustering: Outliers may be detected by clustering, where similar values are organized into groups, or clusters. Intuitively, values that fall outside of the set of clusters may be considered outliers.

Data Integration
Data integration combines data from multiple sources into a coherent data store, as in data warehousing. These sources may include multiple databases, data cubes, or flat files. There are a number of issues to consider during data integration Entity Identification problem and Redundancy (An attribute may be redundant if it can be derived from another attribute or set of attributes.).

Data Transformation
The data are transformed or consolidated into forms appropriate for mining. Data transformation can involve the following: Smoothing, this works to remove noise from the data. Such techniques include binning, regression, and clustering. Aggregation, where summary or aggregation operations are applied to the data. Generalization of the data, where low-level or primitive (raw) data are replaced by higher-level concepts through the use of concept hierarchies. Normalization, where the attribute data are scaled so as to fall within a small specified range, such as 1:0 to 1:0, or 0:0 to 1:0. There are many methods for data normalization: min-max normalization, z-score normalization, and normalization by decimal scaling. Min-max normalization performs a linear transformation on the original data. Suppose that min A and max A are the minimum and maximum values of an attribute, A. Min-max normalization maps a value, v, of A to v in the range [new min A; new max A] by computing

In z-score normalization (or zero-mean normalization), the values for an attribute, A, are normalized based on the mean and standard deviation of A. A value, v, of A is normalized to v by computing

Decimal scaling

Attribute construction (or feature construction), where new attributes are constructed and added from the given set of attributes to help the mining process.

Data Reduction
Complex data analysis and mining on huge amounts of data can take a long time, making such analysis impractical or infeasible. Data reduction techniques can be applied to obtain a reduced representation of the data set. Strategies for data reduction include the following: Data cube aggregation, where aggregation operations are applied to the data in the construction of a data cube. Attribute subset selection, where irrelevant, weakly relevant or redundant attributes or dimensions may be detected and removed. Dimensionality reduction, where encoding mechanisms are used to reduce the data set size. Numerosity reduction, where the data are replaced or estimated by alternative, smaller data representations such as parametric models (which need store only the model parameters instead of the actual data) or nonparametric methods such as clustering, sampling, and the use of histograms. Discretization and concept hierarchy generation, where raw data values for attributes are replaced by ranges or higher conceptual levels. Data cube aggregation Data cubes provide fast access to precomputed, summarized data, thereby benefiting on-line analytical processing as well as data mining. The cube created at the lowest level of abstraction is referred to as the base cuboid. The base cuboid should correspond to an individual entity of interest, such as sales or customer. In other words, the lowest level should be usable, or useful for the analysis. For example:

Attribute subset selection reduces the data set size by removing irrelevant or redundant attributes (or dimensions). The goal of attribute subset selection is to find a minimum set of attributes such that the resulting probability distribution of the data classes is as close as possible to the original distribution obtained using all attributes. Mining on a reduced set of attributes has an additional benefit. It reduces the number of attributes appearing in the discovered patterns, helping to make the patterns easier to understand. Basic heuristic methods of attribute subset selection include the following techniques, some of which are: 1. Stepwise forward selection: The procedure starts with an empty set of attributes as the reduced set. The best of the original attributes is determined and added to the reduced set. At each subsequent iteration or step, the best of the remaining original attributes is added to the set. 2. Stepwise backward elimination: The procedure starts with the full set of attributes. At each step, it removes the worst attribute remaining in the set. 3. Combination of forward selection and backward elimination: The stepwise forward selection and backward elimination methods can be combined so that, at each step, the procedure selects the best attribute and removes the worst from among the remaining attributes. 4. Decision tree induction: Decision tree algorithms, such as ID3, C4.5, and CART, were originally intended for classification. Decision tree induction constructs a flowchart like structure where each internal (non leaf) node denotes a test on an attribute, each branch corresponds to an outcome of the test, and each external (leaf) node denotes a class prediction. At each node, the algorithm chooses the best attribute to partition the data into individual classes. When decision tree induction is used for attribute subset selection, a tree is constructed from the given data. All attributes that do not appear in the tree are assumed to be irrelevant. The set of attributes appearing in the tree form the reduced subset of attributes.

Dimensionality reduction data encoding or transformations are applied so as to obtain a reduced or


compressed representation of the original data. If the original data can be reconstructed from the compressed data without any loss of information, the data reduction is called lossless. If, instead, we can reconstruct only an approximation of the original data, then the data reduction is called lossy. two popular and effective methods of lossy dimensionality reduction: 1.wavelet transforms and principal components analysis.

Вам также может понравиться