Вы находитесь на странице: 1из 27

Introduction to WEKA Data Mining

WEKA - what is it?

Integration with
Projects based on
Data Mining
A definition: Extraction of implicit, previously unknown, and potentially
useful information from data
Goal (business oriented): improve marketing, sales, and customer support
Who is likely to remain a loyal customer/jump ship?
What products should be marketed to which prospects?
What determines whether a person will respond to a certain offer?
How can I detect potential fraud?
Central idea: historical data contains information that will be useful in the
Historical patterns provide useful insight and generalize to future situations
Data Mining: algorithms that automatically detect patterns and regularities
in data
Data Mining
Strong patterns can be used to make predictions
Problem 1: a lot of patterns are not interesting
Problem 2: patterns may be inexact (or even completely spurious) if data is garbled
or missing
Techniques borrowed from statistics, computer science, machine learning
Compared to traditional statistics
Statistics is manual, user driven, top-down - formulate a hypothesis, convert
hypothesis into database query, test significance of results
Data mining automates the data interrogation
Data-driven, self-organizing, bottom-up
Automatic examination of a large number of hypothesis
Compared to OLAP
OLAP: data summarization - aggregation via addition
# widgets sold in all ZIP codes in the country
Data Mining: ratios, patterns and influences
Factors influencing the sales of the widgets in those ZIP codes
DM can enhance OLAP - suggest dimensions for cube, discretization etc.
Data Mining is a Process

Preprocessed Extracted Assimilated
Selected data
data information knowledge

Select Preprocess Transform Mine Analyze &

What is WEKA?

Copyright: Martin Kramer (mkramer@wxs.nl)

WEKA: The Software
WEKA (Waikato Environment for Knowledge Analysis)

Funded by the NZ government for more than 10 years

Develop an open-source state-of-the art workbench of data mining tools

Explore fielded applications
Develop new fundamental methods

Became part of the Pentaho suite in 2006

Pentaho Data Mining (PDM)
Core Functionality
Support for the whole process of experimental data mining
Preparation of input data
Statistical evaluation of learning schemes
Visualization of input data and the result of learning

Tools and algorithms

69 data pre-processing tools
118 classification/regression algorithms
11 clustering algorithms
18 attribute/subset evaluators + 12 search algorithms for feature selection
6 algorithms for finding association rules
User Interfaces
Explorer - data exploration/visualization, model construction and export,
preliminary evaluation
Experimenter - large-scale algorithm comparison with statistical tests for
significant differences in performance
KnowledgeFlow - process model view of data mining, export of DM process
Modular, object-oriented architecture
Packages for different types of algorithms: filters, classifiers, clusterers, associations,
attribute selection etc.
Sub-packages group components by functionality or purpose
E.g. classifiers.bayes, filters.unsupervised.attribute

Public interface prescribed by abstract base class or interface for all types of
Algorithms are Java Beans
GUIs use introspection/reflection to dynamically generate editor dialogs at runtime

All components rely to a greater or lesser extent on a core top-level package

Classes and data structures for reading data sources; representing instances, sparse
instances and attributes; and providing common utility methods
Additional interfaces that indicate extra functionality

Packages containing learning schemes have associated Evaluation classes

Routines for performing cross-validation, computing performance metrics, generating
ROC curves etc.
Preprocess panel
Load data from various sources (file, SQL database, URL etc.)
Apply pre-processing filters to the data
Summary statistics & histograms
Classify panel
Apply classification and regression algorithms
Evaluate resulting models
Numerically via statistical estimation
Graphically through visualization (data and model)
Cluster panel
Apply clustering algorithms to the data
Visualize the outcome
Clusters that represent density estimates can be evaluated based on the
statistical likelihood of the data
Associate panel
Learn association rules for market-basket type analysis

Select attributes panel

Mix and match algorithms for evaluating the utility of attributes
and sets of attributes with different search methods
Visualize panel
Color-coded scatter plot matrix of the data
Select, enlarge, zoom in etc.
Knowledge Flow

Define a data mining process

Like the Explorer, all of WEKAs algorithms are available
Data flows through the process from node to node
Accommodates both batch-based processing and data
Command line interface to WEKA can also train incremental
classifiers on data streams
Fully multi-threaded
Accommodates multiple independent flows on the same layout
Knowledge Flows Classifier step is multi-threaded
Build models for more than one cross-validation fold in parallel
Binary and XML-based persistence of flow layouts
Knowledge Flow
Automate the process of determining the best method to
Is an interactive process in the Explorer or Knowledge Flow
Run classification and regression algorithms on a corpus of
data sets
Try different parameter settings
Collect performance statistics
Perform significance tests on the results
Raw output saved to files or databases
Analysis results can be export as text, CSV, Gnuplot, LaTeX
Advanced users can distribute the processing load across
multiple machines

Plugin mechanisms allow WEKA to be extended without

modifying the classes in the WEKA distribution
New tabs can be added to the Explorer
New visualizations can be added to the pop-up menu in
the Explorers Classify panel
Classifier errors, predictions, trees and graphs
Knowledge Flow - Plugins tab
Drop a jar file into $HOME/.knowledgeFlow/plugins/<a plugin>/
Standards and Interoperability

Support for PMML import

Regression, general regression and
neural network model types
More model types and support for
export in future development
LibSVM/SVM-Light data format
Integration With Pentaho
Main point of integration is with Pentaho Data Integration (PDI), aka
the Kettle project
PDI (Kettle) - streaming, engine-driven ETL tool
PDI can export data in ARFF format
High-volume, low memory consumption
WEKA-specific transformation steps
WekaScoring: score data using a pre-constructed WEKA model
(classification, regression or clustering) or PMML model as part of an
ETL transformation
KnowledgeFlow: execute arbitrary Knowledge Flow processes as part
of an ETL transformation
Can be used to automatically refresh/rebuild a predictive model
Scoring as Part of an ETL process
Refreshing a predictive model
Projects Based on/Using WEKA
Open-source data mining systems
Konstanz Information Miner (KNIME) & RapidMiner provide WEKA plugins
R provides an interface to WEKA through the RWeka package
Scientific workflow environment
Kepler Weka project integrates all of WEKAs functionality into the Kepler
open-source scientific workflow platform
Systems for natural language processing
GATE NLP workbench
Balie - language identification, tokenization, sentence boundary
detection, named entity recognition
Kea - automatic keyphrase extraction
Text mining
Judge & IR Utilities - two systems that perform document categorization
and clustering
Projects Based on/Using WEKA
Knowledge discovery in biology
BioWEKA - extension to WEKA for tasks in biology, bioinformatics and
Epitopes Tookit - platform based on WEKA for developing epitope
prediction tools
maxdView & Mayday - visualization and analysis of microarray data
Distributed and parallel data mining
Weka-Parallel & GridWeka - distributed cross-validation, scoring and
FAEHIM & Weka4WS - make WEKA available as a web service
Connectionist and artificial immune system algorithms
Weka Classification Algorithms Project - several artificial neural networks
and artificial immune system based algorithms

Has been downloaded more than 1.5 million times since placed on
SourceForge in April 2000
Current download rate of more than 20,000 per month
Large user-base and active community
2750 people subscribed to the mailing list

Вам также может понравиться