Вы находитесь на странице: 1из 3


NAME weka.associations.Apriori SYNOPSIS Finds association rules. OPTIONS car -- If enabled class association rules are mined instead of (general) association rules. classIndex -- Index of the class attribute. If set to -1, the last attribute is taken as class attribute. delta -- Iteratively decrease support by this factor. Reduces support until min support is reached or required number of rules has been generated. lowerBoundMinSupport -- Lower bound for minimum support. metricType -- Set the type of metric by which to rank rules. Confidence is the proportion of the examples covered by the premise that are also covered by the consequence(Class association rules can only be mined using confidence). Lift is confidence divided by the proportion of all examples that are covered by the consequence. This is a measure of the importance of the association that is independent of support. Leverage is the proportion of additional examples covered by both the premise and consequence above those expected if the premise and consequence were independent of each other. The total number of examples that this represents is presented in brackets following the leverage. Conviction is another measure of departure from independence. Conviction is given by minMetric -- Minimum metric score. Consider only rules with scores higher than this value. numRules -- Number of rules to find. removeAllMissingCols -- Remove columns with all missing values. significanceLevel -- Significance level. Significance test (confidence metric only). upperBoundMinSupport -- Upper bound for minimum support. Start iteratively decreasing minimum support from this value.

Classification is a data mining function that assigns items in a collection to target categories or classes. The goal of classification is to accurately predict the target class for each case in the data. For example, a classification model could be used to identify loan applicants as low, medium, or high credit risks. A classification task begins with a data set in which the class assignments are known. For example, a classification model that predicts credit risk could be developed based on observed data for many loan applicants over a period of time. In addition to the historical credit rating, the data might track employment history, home ownership or rental, years of residence, number and type of investments, and so on. Credit rating would be the target, the other attributes would be the predictors, and the data for each customer would constitute a case. Classifications are discrete and do not imply order. Continuous, floating-point values would indicate a numerical, rather than a categorical, target. A predictive model with a numerical target uses a regression algorithm, not a classification algorithm. The simplest type of classification problem is binary classification. In binary classification, the target attribute has only two possible values: for example, high credit rating or low credit rating. Multiclass targets have more than two values: for example, low, medium, high, or unknown credit rating. In the model build (training) process, a classification algorithm finds relationships between the values of the predictors and the values of the target. Different classification algorithms use different techniques for finding relationships. These relationships are summarized in a model, which can then be applied to a different data set in which the class assignments are unknown.
1. Decision tree a. A flow-chart-like tree structure b. Internal node denotes a test on an attribute c. Branch represents an outcome of the test d. Leaf nodes represent class labels or class distribution 2. Decision tree generation consists of two phases a. Tree construction i. At start, all the training examples are at the root ii. Partition examples recursively based on selected attributes

b. Tree pruning 3. Identify and remove branches that reflect noise or outliers Clustering

The basic step of k-means clustering is simple. In the beginning we determine number of cluster K and we assume the centroid or center of these clusters. We can take any random objects as the initial centroids or the first K objects in sequence can also serve as the initial centroids. Then the K means algorithm will do the three steps below until convergence Iterate until stable (= no object move group): 1. Determine the centroid coordinate 2. Determine the distance of each object to the centroids 3. Group the object based on minimum distance