Вы находитесь на странице: 1из 8

Compression Of Different Data Mining Tools Used Globally

Submitted by : Pranesh P , 135 , Section B

Data mining tools comparison - Summary KNIME is very easy to use and is good for preprocessing of datasets and descriptors. Personally, among the various software, I enjoy using KNIME the most. It is a pity that it is weaker in its model building and validation portion. Hopefully the next major version of KNIME will address these issues. RapidMiner has a very large set of operators, which makes it very suitable for comparing different machine learning/statistical methods. It is also very good for model building and validation. However, the learning curve for the software is rather steep. Weka (KnowledgeFlow) is somewhat in between KNIME and RapidMiner. Like RapidMiner, it has quite a large number of components and like KNIME, it is relatively simple to use. However, it is not able to perform all the functions that are available in RapidMiner and its graphical user interface is not as friendly as KNIME. TANAGRA is similar to RapidMiner in terms of the layout for representing an experimental procedure. However it has significantly less operators than RapidMiner. My initial impression of it is that it should be quite good for performing QSAR experiments. However, after using it, it seems like it is lacking in several important features. Orange is similar to Weka (KnowledgeFlow) in terms of layout. However, like TANAGRA, it seems to be lacking in some important features for QSAR experiments. A missing feature in all these software is the ability to perform parallel computing, either through job distribution among different computers in the network or through the use of all the cores in multi-core CPUs. Table 1 shows a comparison of the five software for performing procedures that are widely used in QSAR experiments. The best software appears to be RapidMiner. At a first glance, Weka seems to be redundant since RapidMiner has incorporated most of its algorithms. However, it still contains some algorithms, especially in the area of descriptor selection, which are not available in other software. Although TANAGRA and Orange are the worst performing software among the five, they do have their own merits. For

instance, TANAGRA has an interesting collection of statistical tests while Orange has some interesting prototypes like MeSH Term Browser. Personally, I will invest my time to learn KNIME, RapidMiner, and Weka well, and will use these three software for my future research work. Table 1: Comparison of the four software for performing procedures that are widely used in QSAR experiments. Procedure Partitioning of dataset into training and testing sets. KNIME Pass (but limited partitioning methods) RapidMiner Weka Pass (but limited partitioning methods) Pass (but limited partitioning methods) TANAGRA Pass (but limited partitioning methods) Orange Pass (but limited partitioning methods)

Descriptor scaling

Pass

Pass

Fail (cannot save Fail (cannot save parameters parameters for for scaling to scaling to apply to apply to future datasets) future datasets)

Fail (no scaling methods)

Descriptor selection

Fail (no wrapper methods)

Pass

Pass (but is not part of KnowledgeFlow)

Fail (wrapper methods valid Fail (no for logistic wrapper regression methods) only)

Parameter optimization of machine learning/statistical methods

Fail (not automatic)

Pass

Fail (not automatic)

Fail (not automatic)

Fail (not automatic)

Model validation using crossvalidation and/or independent validation set

Pass (but limited error Pass measurement methods)

Pass (but cannot save model so have to rebuild model for every future dataset)

Pass (but cannot save Fail (cannot model so validate have to independent rebuild validation set) model for every future dataset)

Lastly, I need to reiterate that the above comments and all the previous posts on these software are very subjective. They are subjective because I have a vested interest in QSAR type of modeling and also because I am not very familar with these software (I have never used them in any of my research projects). Thus there may be factual inaccuracies about my review (i.e. some procedures which I stated that a particular software is unable to do may be false). The authors of these software or readers who are experienced with these software are welcome to comment on these factual inaccuracies and I will update the posts to reflect the truth.

KNIME: Knime is a widely used open source data mining, visualisation and reporting graphical workbench used by over 3000 organisations. Knime desktop is the entry open source version of Knime (other paid for versions are for organisations that need support and additional features). It is based on the well regarded and widely used Eclipse IDE platform, making it as much a development platform (for bespoke extensions) as a data mining platform. It incorporates hundreds of different nodes for data I/O, preprocessing, cleansing, modelling, analysis and data mining. WEKA analysis modules are also incorporated and an additional plugin allows R scripts to be run. Knime runs on Windows, Mac OS X and Linux. Benefits and Requirements

KNIME Trusted Partner Your company should meet these basic business requirements to be eligible for the KNIME Trusted Partner Program:

Established business with strong financial health Analytic, Data mining and/or BI consulting and implementation experience Ability to commit and deliver on a revenue target Availability of dedicated sales and marketing resources Strong fit between KNIME professional products and Partner's services Strong overlap between KNIME's and Partner's target audience or customer base Ability to establish and maintain a pipeline of customers and generate a steady flow of leads Customer references A minimum of one sales and one technical representative

TANAGRA TANAGRA is a free DATA MINING software for academic and research purposes. It proposes several data mining methods from exploratory data analysis, statistical learning, machine learning and databases area. This project is the successor of SIPINA which implements various supervised learning algorithms, especially an interactive and visual construction of decision trees. TANAGRA is more powerful, it contains some supervised learning but also other paradigms such as clustering, factorial analysis, parametric and nonparametric statistics, association rule, feature selection and construction algorithms... TANAGRA is an "open source project" as every researcher can access to the source code, and add his own algorithms, as far as he agrees and conforms to the software distribution license. The main purpose of Tanagra project is to give researchers and students an easy-to-use data mining software, conforming to the present norms of the

software development in this domain (especially in the design of its GUI and the way to use it), and allowing to analyse either real or synthetic data. The second purpose of TANAGRA is to propose to researchers an architecture allowing them to easily add their own data mining methods, to compare their performances. TANAGRA acts more as an experimental platform in order to let them go to the essential of their work, dispensing them to deal with the unpleasant part in the programmation of this kind of tools : the data management. The third and last purpose, in direction of novice developers, consists in diffusing a possible methodology for building this kind of software. They should take advantage of free access to source code, to look how this sort of software is built, the problems to avoid, the main steps of the project, and which tools and code libraries to use for. In this way, Tanagra can be considered as a pedagogical tool for learning programming techniques. TANAGRA does not include, presently, what makes all the strength of the commercial softwares in this domain : a wide set of data sources, direct access to datawarehouses and databases, data cleansing, interactive utilization, ...

Orange This is a very capable open source visualisation and analysis tool with an easy to use interface. Most analysis can be achieved through its visual programming interface (drag and drop of widgets) and most visual tools are supported including scatterplots, bar charts, trees, dendograms and heatmaps. A large number (over 100) of widgets are supported. These cover data transformation, classification, regression, association, visualisation and unsupervised learning methods. There are also some specialised add-ons covering bioinformatics, text mining and other specialist requirements. The environment is extendible through Python scripting and this includes creating new widgets if needed. The documentation is good too and includes first steps, detailed widget descriptions and scripting. It runs on Windows, Mac OS X and Linux.

R Strictly speaking R is a programming language, but there are literally thousands of libraries that can be incorporated into the R environment making it a powerful data mining environment. In reality R is probably the most flexible and powerful data mining environment available, but it does require high levels of skill. From a career perspective learning R is a good investment. Many enterprise tools support R (SAP Predictive Analysis, Tibco Spotfire for example) and it addresses much more than data mining. RevolutionAnalytics has based its products on R and have added a graphical front-end. They also offer a free version of R that is calimed to be faster than the general distribution. Rapid Miner This is perhaps the most widely used open source data mining platform (with over 3 million downloads). It incorporates analytical ETL (Extract, Transform and Load), data mining and predictive reporting. The graphical user interface and visualisation tools are excellent, with considerable intelligence built into the workflow construction process. This provides on-the-fly error recognition and suggested quick fixes. Its meta data transformation capability is unique among tools of this nature allowing results to be inspected at design time. It incorporates over 500 operators and includes the WEKA machine learning library. Many extensions are available for analysis of time series and text and other specialised processes. Most data sources are supported including Excel, Access, Oracle, IBM DB2, Microsoft SQL Server, Sybase, Ingres, My SQL, text files and others. Rapid-i provides support and training services for organisations that want a supported product.

WEKA This set of data mining tools is incorporated into many other products (Knime and Rapid Miner for example), but it also a stand-alone platform for many data mining tasks including preprocessing, clustering, regression, classification and visualisation. The support for data sources is extended through Java Database Connectivity, but the default format for data is the flat file. WEKA comes from the highly respected machine learning group at the University of Waikato, New Zealand (same origin as the 11AntsAnalytics Excel data mining tool). Models can be built using a graphical user interface or a command line input.

Вам также может понравиться